Question Answering with GPT-3
In this article I will walk through an example of a Question Answering system using GPT-3. I have previously experimented with Question Answering models using BERT, some of which I have detailed in my post here. I felt trying the same use case with GPT-3 would be interesting.
EXAMPLE INPUT AND OUTPUT
Differences between BERT and GPT-3
BERT and GPT-3 are both models which are highly respected and which have done well in a multitude of NLP use cases. Variations of both models do well in commonly accepted NLP benchmarks such as the General Language Understanding Evaluation (GLUE). GPT-3 is a LOT larger at 175 billion parameters vs 110–330 million parameters for BERT (depending on which variant you are using). However, size isn’t everything and there are differences which impact usage and training. BERT is a language encoding model — i.e. it is just an encoder. You typically will need to add layers to serve as the decoder and train the model to use it for specific use cases (e.g. classification, entity recognition, question answering etc.) GPT-3 models such as text-davinci are more complete text generation models and can easily be used for specific use cases without any kind of training or modeling required(as we shall see below). They can, however, be tweaked and fine tuned (again, as we shall see below).
In this article I will go over the following;
- Using Prompt Engineering to fit GPT-3 to various tasks.
- Fine-tuning the output of GPT-3 for Question Answering.
- Using Encoding with GPT-3 to find context for a question within a large body of text.
Some notes on using GPT-3…
I will be using the Open AI API which is available here; OpenAI API. This is available to everybody at this point but you will need to set up an account and it does have a cost to use (apart from a limited free trial).
You can use the API a couple of ways. From the UI, you could go to the playground, choose the model you want and enter your text. This is good for a quick test of behavior and to experiment with various prompts. Programmatically, you could access the API from the openai library. Conveniently, the playground link above has an option to generate Python code for you.
Here is some sample code to get Sentiment from a text. (The API Key can be created from your Open AI account.)
import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")
response = openai.Completion.create(
model="text-davinci-003",
prompt="What is the sentiment of the below Text?\n\nSally went to a doctor and he prescribed Ibuprofen for her headache. It cured her headache, however, after 2 days she noticed an rash on her left shoulder. after a week, she experienced servere nausea. Not sure if this is related to the Ibuprofen but thought you should know. The headache is gone. She did have some suspicious looking sushi before the rash. However, the doctor was not available for followup calls\n\nSENTIMENT:",
temperature=0.7,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
print(response["choices"][0]["text"])
## Concerned
The various variables like temperature and top_p can be tweaked for performance and output variability.
From OpenAI you get pretrained models like davinci, curie, babbage & ada. I will not go into the differences between these as that is well documented in the OpenAI site.
Prompt Engineering
This, to me, is the biggest plus point of using GPT-3 vs building models with traditional ways. You can get a lot done simply by creating an appropriate prompt.
Below are some examples. All of these were generated using the text-davinci-003 model from Open AI.
Example 1: Extracting Sentiment from Text
This is something which would require a model to be trained if you were using something like BERT or Scikit-learn. With GPT-3 you pass a prompt prefix such as the below, followed by the text
What is the sentiment of the below Text?
For example:
Example 2: Extracting Adverse Events from a Text
This is a use case which has value in industries like Life Sciences.
Example 3: Extracting Named Entities from a Text
Again something which requires a trained model. This text;
Extract all the named entities in the below Text.
The men’s high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics.
Barshim in particular was heard to ask a competition official “Can we have two golds?” in response to being offered a ‘jump off’. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men’s high jump for Italy and
Belarus, the first gold in the men’s high jump for Italy and Qatar, and the third consecutive medal in the men’s high jump for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).
NAMED ENTITIES:
Gives an output of…
Summer Olympics, Olympic Stadium, Gianmarco Tamberi, Qatari, Mutaz Essa Barshim, Italy, Belarus, Maksim Nedasekau, Qatar, Patrik Sjöberg, Sweden
Example 4: Question Answering from a Text
…an example from the SQuAD Dataset
Fine-Tuning a Question Answering Model
As you can see from the examples above, you can do a lot with the models AS-IS by fairly simple Prompt Engineering.
I tested out davinci and curie against the SQuAD v2 dataset. While the model performed well with no training, I did see misses.
- Part of SQuAD v2 is a set of un-answerable questions and part of the challenge is to identify those questions. I noticed a success rate of <50% with these in my runs.
- In some cases, the model was returning a lot more text than I wanted. For example, the correct answer to;
When did Beyonce leave Destiny’s Child and become a solo singer?
is (according to the SQuAD dataset, no idea if this is actually the case)
2003
The model gave me an answer of
Beyonce left Destiny’s Child and became a solo singer in 2003 with the release of her debut album, Dangerously in Love.
… which is technically correct but is more text than what I wanted.
The goal of my fine-tuning was to fix these issues.
According to Open AI here ;
To fine-tune a model that performs better than using a high-quality prompt with our base models, you should provide at least a few hundred high-quality examples, ideally vetted by human experts. From there, performance tends to linearly increase with every doubling of the number of examples. Increasing the number of examples is usually the best and most reliable way of improving performance.
I take that to mean a train set of a few hundred records is the recommendation with better results (and presumably a risk of overfitting) with more data.
Interestingly, in the sample code on their GitHub, Open AI also recommends that, for Question answering use cases where you extract text from the body of the context, you use embeddings. I will cover that in the next section.
To finetune the model you will need to construct train and test jsonl files such as the below.