I’m working on a project where I’m trying to harvest lists of data from ChatGPT via its API. This post applies to the gpt-3.5-turbo model called via the API. I don’t have access to the GPT 4.0 API yet, but I’m on the waitlist.
When I first looked into it, I was excited when I learned that ChatGPT can return JSON (as opposed to regular text). This is really cool when it works. You can even tell it the specific format you want and it will do it.
An example prompt for getting a list of piano pieces:
Tell me the top 10 classical piano pieces played today. Provide the response in JSON only, following this format: [{ "name": {name of the piece}, "composer": {composer}, "difficulty_level": {difficulty level} }, ...]
However, in practice using API calls 80% of the time it does it, the other 20% of the time it says something like:
“Sorry, as an AI language model, I am not able to provide a JSON response only.”
This is in spite of the fact that I TOLD IT to return JSON and NOTHING else. So it acts a bit difficult at times, and you never know which one you will get… which seems very non-computer like but it is designed to vary the output to seem more human like.
For a script like this that makes repeated calls, having the output format alternate between text and json is unacceptable.
So the solution was to tell it to return a plain list.
Tell me the top 10 classical piano pieces played today. Provide the response with one answer per line, no leading numbers or symbols, and no other commentary
This approach provides consistent results of simple lists.
This way the format is consistent, but the data still needs cleansing. Sometimes it numbers the items (1., 2., 3…) or adds a hyphen at the start of each line. Sometimes it adds a helpful clippy like message before the list (“Sure I can help you with that”), which needs to be filtered out.
Here is a simplified version of code I used, written in Python:
try: # this code calls the ChatGPT API result = call_open_ai(self.opening_conversation) except Exception as e: print('Unable to get initial data from OpenIA') raise e items = list() try: # get the content of the response from ChatGPT raw_answer = result['choices'][0]['message']['content'] for line in raw_answer.splitlines(): line = clean_line(line) if skip_line(line): continue items.append({'name': line}) if not items: print('Nothing came back?') print(raw_answer) return items except Exception as e: print('Unable to parse JSON as string') print('Raw result was ' + str(result)) raise e def clean_line(self, line): """ Cleanup string data from ChatGPT. :param line: string from ChatGPT :return: """ line = line.strip() # clean up the 'bullets' it adds to lists if line.startswith('- '): line = line[2:] # clean up the numbering even though WE TOLD IT NOT TO! for i in range(1, 100): if line.startswith(str(i) + '. '): x = len(str(i)) + 2 line = line[x:] # if the last character is a ' or ", remove it if line.endswith('"') or line.endswith("'"): line = line[:-1] # replace all accents with regular ASCII characters to smooth out variations str_normalized = unicodedata.normalize('NFKD', line) str_bytes = str_normalized.encode('ASCII', 'ignore') line = str_bytes.decode('ASCII') return line def skip_line(self, line): """ Helper method to tell if this line is chatter from ChatGPT that can be ignored. :param line: string returned from ChatGPT :return: """ # skip empty lines if not line: return True # it is trying to be friendly, but we don't want this garbage in the data if line.startswith('Sure, I') or \ line.startswith('Sure I') or \ line.startswith('Sure, here') or \ line.startswith('Sure here') or \ line.startswith('Here are') or \ line.startswith('Okay, here') or \ line.startswith('Okay here'): return True return False
One thing I found is if you ask it for too much data, it won’t return a result. You have to trick it into paginating it, or getting a chunk at a time.
This is part of an open source project I started called Visualizing Relationships in ChatGPT. Stay tuned for more posts about my findings.