avatarVenkat Ram Rao

Summarize

Question Answering with GPT-3

In this article I will walk through an example of a Question Answering system using GPT-3. I have previously experimented with Question Answering models using BERT, some of which I have detailed in my post here. I felt trying the same use case with GPT-3 would be interesting.

EXAMPLE INPUT AND OUTPUT

Source: Self

Differences between BERT and GPT-3

BERT and GPT-3 are both models which are highly respected and which have done well in a multitude of NLP use cases. Variations of both models do well in commonly accepted NLP benchmarks such as the General Language Understanding Evaluation (GLUE). GPT-3 is a LOT larger at 175 billion parameters vs 110–330 million parameters for BERT (depending on which variant you are using). However, size isn’t everything and there are differences which impact usage and training. BERT is a language encoding model — i.e. it is just an encoder. You typically will need to add layers to serve as the decoder and train the model to use it for specific use cases (e.g. classification, entity recognition, question answering etc.) GPT-3 models such as text-davinci are more complete text generation models and can easily be used for specific use cases without any kind of training or modeling required(as we shall see below). They can, however, be tweaked and fine tuned (again, as we shall see below).

In this article I will go over the following;

  1. Using Prompt Engineering to fit GPT-3 to various tasks.
  2. Fine-tuning the output of GPT-3 for Question Answering.
  3. Using Encoding with GPT-3 to find context for a question within a large body of text.

Some notes on using GPT-3…

I will be using the Open AI API which is available here; OpenAI API. This is available to everybody at this point but you will need to set up an account and it does have a cost to use (apart from a limited free trial).

You can use the API a couple of ways. From the UI, you could go to the playground, choose the model you want and enter your text. This is good for a quick test of behavior and to experiment with various prompts. Programmatically, you could access the API from the openai library. Conveniently, the playground link above has an option to generate Python code for you.

Here is some sample code to get Sentiment from a text. (The API Key can be created from your Open AI account.)

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.Completion.create(
  model="text-davinci-003",
  prompt="What is the sentiment of the below Text?\n\nSally went to a doctor and he prescribed Ibuprofen for her headache. It cured her headache, however, after 2 days she noticed an rash on her left shoulder. after a week, she experienced servere nausea. Not sure if this is related to the Ibuprofen but thought you should know. The headache is gone. She did have some suspicious looking sushi before the rash. However, the doctor was not available for followup calls\n\nSENTIMENT:",
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

print(response["choices"][0]["text"])
## Concerned

The various variables like temperature and top_p can be tweaked for performance and output variability.

From OpenAI you get pretrained models like davinci, curie, babbage & ada. I will not go into the differences between these as that is well documented in the OpenAI site.

Prompt Engineering

This, to me, is the biggest plus point of using GPT-3 vs building models with traditional ways. You can get a lot done simply by creating an appropriate prompt.

Below are some examples. All of these were generated using the text-davinci-003 model from Open AI.

Example 1: Extracting Sentiment from Text

This is something which would require a model to be trained if you were using something like BERT or Scikit-learn. With GPT-3 you pass a prompt prefix such as the below, followed by the text

What is the sentiment of the below Text?

For example:

Source: Self

Example 2: Extracting Adverse Events from a Text

This is a use case which has value in industries like Life Sciences.

Source: Self

Example 3: Extracting Named Entities from a Text

Again something which requires a trained model. This text;

Extract all the named entities in the below Text.

The men’s high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics.

Barshim in particular was heard to ask a competition official “Can we have two golds?” in response to being offered a ‘jump off’. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men’s high jump for Italy and

Belarus, the first gold in the men’s high jump for Italy and Qatar, and the third consecutive medal in the men’s high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).

NAMED ENTITIES:

Gives an output of…

Summer Olympics, Olympic Stadium, Gianmarco Tamberi, Qatari, Mutaz Essa Barshim, Italy, Belarus, Maksim Nedasekau, Qatar, Patrik Sjöberg, Sweden

Example 4: Question Answering from a Text

…an example from the SQuAD Dataset

Source: Self

Fine-Tuning a Question Answering Model

As you can see from the examples above, you can do a lot with the models AS-IS by fairly simple Prompt Engineering.

I tested out davinci and curie against the SQuAD v2 dataset. While the model performed well with no training, I did see misses.

  1. Part of SQuAD v2 is a set of un-answerable questions and part of the challenge is to identify those questions. I noticed a success rate of <50% with these in my runs.
  2. In some cases, the model was returning a lot more text than I wanted. For example, the correct answer to;

When did Beyonce leave Destiny’s Child and become a solo singer?

is (according to the SQuAD dataset, no idea if this is actually the case)

2003

The model gave me an answer of

Beyonce left Destiny’s Child and became a solo singer in 2003 with the release of her debut album, Dangerously in Love.

… which is technically correct but is more text than what I wanted.

The goal of my fine-tuning was to fix these issues.

According to Open AI here ;

To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.

I take that to mean a train set of a few hundred records is the recommendation with better results (and presumably a risk of overfitting) with more data.

Interestingly, in the sample code on their GitHub, Open AI also recommends that, for Question answering use cases where you extract text from the body of the context, you use embeddings. I will cover that in the next section.

To finetune the model you will need to construct train and test jsonl files such as the below.

The following code works…

#I had preloaded my contexts, questions and answers into variables

def create_fine_tuning_dataset_train():
    rows = []
    questions = []
    for i in range(0,400):
    
        c = valid_contexts[i]
        q = valid_questions[i]
        a = valid_answers[i]['text']
        if q not in questions:
            questions.append(q)
        else:
            rows.append({"prompt":f"{c}\nQuestion: {q}\nAnswer:", "completion":f" {a} --END--\n"})
    for i in range(0,400):
        c = invalid_contexts[i]
        q = invalid_questions[i]
        if q not in questions:
            questions.append(q)
        else:
            rows.append({"prompt":f"{c}\nQuestion: {q}\nAnswer:", "completion":f" No appropriate context found --END--\n"})
    
    return pd.DataFrame(rows) 
  
ft = create_fine_tuning_dataset_train()
ft.to_json(f'fine_tuned_qna_train.jsonl', orient='records', lines=True)

Note the — END — \n. That indicates when to stop reading the output. GPT-3 is a text generation model and will keep generating text when you invoke it. A stop sequence helps identify when to cut it off.

Also note, that the prompt does not contain a prefix. Essentially, it will try to answer the question if the information is there and return No appropriate context found if it cannot.

You can then submit the files to the openai API for training using a command such as the below.

!openai api fine_tunes.create -t “fine_tuned_qna_train.jsonl” -v “fine_tuned_qna_test.jsonl” -m “davinci” — batch_size 8

The above will fine tune davinci based on the examples. There are hyperparameters to control things like epochs and batch sizes documented on the OpenAI site.

This typically takes a while to run as the jobs get queued on OpenAI. The following command keeps track of the status of the job.

!openai api fine_tunes.follow -i

Here is the code to invoke the tuned model. Note the stop=[“ — END — \n”].

response = openai.Completion.create(
  model="FINE-TUNED-MODEL",
  prompt="The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\\\"Norman\\\" comes from \\\"Norseman\\\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries. \n\nQuestion: In what country is Normandy located?\nAnswer:",
  temperature=0,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  stop=[" --END--\n"]
)


print(response["choices"][0]["text"])
#France

ONE NOTE: it is possible to fine tune discriminator models to determine if a piece of text holds the answer to a question. I did not try that out for this exercise as this article is already fairly long but it is worth exploring.

Question Answering using Embeddings

Essentially, the process as as follows. My full code is HERE. I re-purposed the code provided by Open AI available here with a few changes to fix bugs found during testing.

  1. Use one of the openai embedding models (e.g. text-embedding-ada-002) to create embeddings for various sections of your text. This should provide you with an n-dimensional vector representing each text section. (ada-002 seems to give a 1536 dimension vector). In my case, since I was using the SQuAD dataset, each question comes with an associated context so I created embeddings for each context.
  2. At the time of invocation, use the same model to create an embedding for the input question.
  3. Calculate the cosine similarity between the input question and every text section. This should identify the text section(s) most likely to contain the answer.
  4. Use a model such as vanilla davinci or curie or your fine tuned model. Pass the context(s) with highest cosine similarity and the input question with an appropriate prefix and get the answer.

You will notice a couple of points;

  1. Step 4 is un-related to the previous steps. If you are not comfortable with GPT-3, you could substitute a different model while using GPT-3 for embeddings.
  2. Steps 1–3 are not exactly ground breaking. Finding cosine similarity between embeddings is something that has been used for years. The benefit here is that you are using the power of an LLM to generate what are hopefully better embeddings.

A couple of callouts…

I saved my embeddings as a Pickle file;

def get_embedding(text):
    result = openai.Embedding.create(
      model=EMBEDDING_MODEL,
      input=text
    )
    return result["data"][0]["embedding"]

#create embeddings for all the contexts.
tot_contexts = len(unique_contexts)
doc_embeddings = {}
for i in tqdm(range (0,tot_contexts), total=tot_contexts, desc="status:"):
    doc_embeddings[i] = get_embedding(unique_contexts[i])

#save embeddings so you don't have to run more than once
with open('saved_embeddings.pkl', 'wb') as f:
    pickle.dump(doc_embeddings, f)

#load the embeddings for use
with open('saved_embeddings.pkl', 'rb') as f:
    loaded_embeddings = pickle.load(f)

I found that I had to be careful constructing the prompt as formatting mattered (this was an issue with the OpenAI sample code)

def construct_prompt(question) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, document_section in most_relevant_document_sections[:3]:
        # Add top 3 contexts        
        chosen_sections.append(SEPARATOR + document_section)
            
    context = "".join(chosen_sections)
    
    out_prompt = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."
    
    Context:
    {context}
    
    QUESTION: {question}
    ANSWER:
    """.format(context=context,question=question)
    
    return out_prompt

Also, based on whether you are using a trial version, you might hit a rate limit for some of the calls.

The end result of the above is that, once you create embeddings for all your text sections, you will not need to pass the context when you pass the question to get an answer.

For example:

get_answer("Whe was the lead singer of destiny's child?")
#'Beyoncé Giselle Knowles-Carter'

Conclusion

Throughout this exercise, I kept comparing what I was doing to my previous experience Fine-Tuning BERT for Question Answering.

GPT-3 is a lot easier to use than some alternatives and a lot of use cases can be met using Prompt Engineering. On the other hand, I also feel I have more control on my model with BERT. With BERT, I kept playing around with things like the Loss Function, Dropout layers etc. to get what I wanted. Fine tuning GPT-3 was a LOT easier but feels like a bit of a black box to me. I did not run the entire SQuAD dataset through GPT-3 since each invocation costs money. However, after testing a few thousand records GPT-3 seems more accurate.

In terms of deployment, you can deploy BERT as a self contained Docker container in a multitude of ways. The implementation of GPT-3 I have described above relies on the Open AI API endpoint. However, there are alternative GPT-3 deployments on Azure etc. which will be worth exploring.

Food for thought…

Gpt 3
OpenAI
Question Answering
NLP
ChatGPT
Recommended from ReadMedium