How to extract JSON from LLM response

How to extract JSON from LLM response

JSON Simplified: Unlocking Data Treasures at Your Fingertips!

I have been working with LLM (Large Language Models ) for the past 8-9 months now. I started with OpenAI's GPT models. Using GPT LLMs is Seamless (Most of the time). Suppose you have to give some instruction to improve the response, just a few changes, and boom, you can get the expected response.
But when it comes to the Open-source Large Language Model it becomes Nerve-racking. I recently worked with a Chatbot and wanted to extract a particular JSON output. It was a finetuned model so obviously, it lacked following instructions through the system prompt. After pouring my heart and soul into it I finally got the expected output(4 days😬).

I have shared both So without wasting any more time let's get into it.
I won't be able to share the exact output because it was a company task where I was doing an internship. But I will make sure to give you enough insights that you will easily able to understand it.

How to get JSON in OpenAI models

It's quite easy. Just write in the prompt that "Give JSON response only" or simply use response_format={ "type": "json_object" }

Here is an example:

from openai import OpenAI

openai.api_key = 'your-api-key'

client = OpenAI()
response = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
response_format={ "type": "json_object" },
messages=[
        {"role": "system", "content": "You are a helpful assistant designed to output JSON. Convert given text into JSON format"},
        {"role": "user", "content": text_to_convert + "store them in component json object"}
    ]
    )
components = response.choices[0].message.content
print(components)

How to get JSON response in an Open-source Large Language Model

It's not really a straight way, when it comes to getting a JSON from the response of Open-source LLMs. Open-source need different level of Prompt engineering, each model might perform differently for a good prompt. It gets little tricky to extract JSON from a response without an indent error while executing json.loads().

Available options

There few available options that will help you with JSON schema :

  1. lm-format-enforcer

  2. guidance

  3. jsonformer

  4. outlines

Doing it from scratch

So during this project, I was not really aware of above mentioned solutions.

Here is the error that I was getting

After doing lots and lots of struggle I figured it out at the last. This is what I had to extract from the LLM response and it has to successfully execute through json.loads()

{
    "key1": "value1",
    "key2": "value2",
    "key3": "value3",
    "key4": "value4",
    "key5": {
        "key5.1": "value5.1",
        "key5.2": "value5.2",
        "key5.3": [
            "value5.3"
        ]
    },
    "key6": "value6",
    "key7": "value7"
}

In the LLM response, this JSON schema was generated after the word text: JSON schema . So here is the logic that I built.

# Find the starting index of the substring "text':"
start_index = text_string.find("text':") + len("text':")

# Find the index of the substring "key7" starting from the start_index
key7_index = text_string.find("key7", start_index)

# Find the index of the closing curly brace '}' starting from the key7_index
end_index = text_string.find('}', key7_index)

# Extract the substring from the start_index up to the end_index (inclusive), and remove any leading or trailing whitespace
extracted_text = text_string[start_index:end_index + 1].strip()

Now we have the first and last curly braces. You might be thinking that's it, we extracted JSON work finished. But a real problem arises here. As we pass this in json.loads() it throws an indent error as LLM might generate the JSON schema with an indent at the beginning. To solve this I wrote the following code.

# Find the index of the first opening curly brace '{' within the extracted_text
start_index = extracted_text.find('{') + 1

# Extract text from extracted_text starting from the position after the first '{'
text_after_first_brace = extracted_text[start_index:]

# Find the index of the last closing curly brace '}' within the text_after_first_brace
end_index = text_after_first_brace.rfind('}')

# Extract text from text_after_first_brace up to the position of the last '}'
text_before_last_brace = text_after_first_brace[:end_index]

# Concatenate the extracted text with braces to ensure a valid JSON format
result = "{" + text_before_last_brace + "}"

That it, you get your JSON schema that can be converted into dict using json.loads() . In my case, in some places, LLM used to write single quotes. It was also a cause for now getting a dict.

# Replace single quotes "'" with double quotes '"' in the result
output_string = result.replace("'", '"')

# Use regular expression substitution to remove backslashes '\'
output_string2 = re.sub(r'\\', '', output_string)
json_data = json.loads(output_string2)

I hope this blog will help you to finally extract JSON schema 😉.

Thank you for reading 😁.

If you like my work, you can support me here: Support my work

I do welcome constructive criticism and alternative viewpoints. If you have any thoughts or feedback on our analysis, please feel free to share them in the comments section below.

For more such content make sure to subscribe to my Newsletter here

Follow me on

Twitter

GitHub

Linkedin

Did you find this article valuable?

Support Kaushal Powar by becoming a sponsor. Any amount is appreciated!