ChatGPT to Evaluate Generated Text
How good is ChatGPT for evaluating automatic summarization, story generation, and data-to-text generation?

ChatGPT has an impressive ability to perform natural language processing tasks (NLP) with simple instructions.
In a previous article, I presented and discussed the research work by Jiao et al. (2023) who evaluated the ability of ChatGPT to translate. The results are impressive and almost comparable with standard machine translation systems.
But ChatGPT is able to do much more than translating. The challenge is to find for which applications ChatGPT is actually good at or even better than other existing systems.
In this article, I review the work by Wang et al. (2023), who studied the ability of ChatGPT at evaluating natural language generation (NLG):
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
The main objective is to find whether a system like ChatGPT can be used to judge the quality of text generated by NLG systems such as automatic summarization, story generators, and data-to-text generator.
We also want to know how ChatGPT compares with other existing evaluation tools for NLG.
Prompt Engineering
When using large language models such as GPT-3 or ChatGPT, prompt engineering is a critical step to get the best answers for your particular use cases.
For NLG evaluation, we must indicate to ChatGPT what we want to evaluate and with what criteria.
Keep in mind that ChatGPT is not a metric and does not implement metrics. If you ask it to evaluate a text with BLEU or BERTScore, two popular metrics for NLG, it may or may not compute it correctly. Note: In my own experiments with ChatGPT, it seemed to be random: Sometimes, ChatGPT simply refuses to compute a score by saying that it cannot do that, or it will do it and even provide an example of the implementation of the metric written in Python.
To use ChatGPT as an evaluator, first we need to define the task and the aspect to evaluate.
For instance, if we want to evaluate the fluency of a generated text, we should indicate it in the prompt along with a scale. The authors propose the following template:
Score the following [task-ins] with respect to [aspect] with one to five stars, where one star means “[ant-aspect]” and five stars means “perfect [aspect]”. Note that [aspect] measures [aspect-ins].
[Conditioned Text]
[Generated Text]
Stars:
Where:
- [task-ins]: The NLG task that generated the text to evaluate.
- [aspect]: The aspect of the generated text that you want to evaluate.
- [ant-aspect]: The opposite or minimum of the aspect you want to evaluate.
- [aspect-ins]: Details about the aspect to evaluate.
- [Conditioned Text]: The input of your NLG task.
- [Generated Text]: The output of your NLG task.
So for the evaluation of news summarization, we would get:
Score the following news summarization given the corresponding news with respect to fluency with one to five stars, where one star means “disfluency” and five stars means “perfect fluency”. Note that fluency measures the quality of individual sentences, are they well-written and grammatically correct. Consider the quality of individual sentences.
News: [a news article]
Summary: [one generated summary]
Stars:
ChatGPT would reply with a number, for instance “5” to indicate a very good summary.
I cannot think of an NLG task for which this template would not work. The authors don’t provide one either.
This is also quite flexible. You can change the scale for instance. Note: I wouldn’t use a large one, for instance 0 to 100, without expecting some inconsistency in the evaluation, e.g., two similar summaries evaluated with very different scores. In other words, we don’t know whether ChatGPT can be used for a fine-grained evaluation.
Experiments and Results
The authors evaluated the ability of ChatGPT to evaluate text generated for the following tasks:
- Automatic summarization
- Story generation
- Data-to-text generation
They used existing datasets of human judgments as reference.
We want to measure the correlation between ChatGPT and these human judgments. The authors of the study measured it with the standard Spearman, Pearson, and Kendall’s Tau correlations.
As baseline metrics, they used: ROUGE*, BERTScore, MoverScore, PRISM, and BARTScore*
They are all metrics regularly used or state-of-the-art for NLG evaluation. For instance, ROUGE is the most popular metric for automatic summarization.
You can find a description of BERTScore and PRISM in my previous article:
How does ChatGPT compare to these metrics?
Let’s look at the results for automatic summarization:
Note: “Pear.” is the Pearson correlation, not “Pearman”. The authors made this error in all the tables.
The results are impressive. ChatGPT correlates better with human evaluation than the regular metrics for evaluating the coherence and the relevance of a summary.
For consistency and fluency, it is also better than all the other metrics, except BARTScore fine-tuned on CNN and paraphrase data.
For story generation, ChatGPT again outperforms all the other metrics:
The gap between ChatGPT and the other metrics is actually so important that I wonder whether ChatGPT wasn’t trained on this particular dataset.
And then they present the results for the data-to-text task:
The advantage of ChatGPT is much less clear on this last task. Yet, it performs among the best evaluation metrics.
The variations of the performance of ChatGPT from one task to another may be also due to the prompt used. ChatGPT may need a different template or a more detailed description of the task.
Limitations
The authors also transparently discuss the limitations of their study.
The main limitation is that they evaluated only one template for the prompts. They may get much better results by using a more flexible template that could better adapt to particular NLG tasks.
Another limitation is that they have evaluated ChatGPT with only 3 NLG tasks and using only 3 datasets. Many more standard NLG tasks with several datasets in different domains and languages are available.
For instance, dialogue response generation, question answering, and paraphrasing, are all well-studied tasks with large human rating datasets available for evaluation.
Extending the evaluation to other NLG tasks is desirable to assess the generalizability of their conclusions.
Finally, the authors didn’t discuss the possibility that ChatGPT could have been trained on the data they used for evaluation. These datasets are publicly available so we can expect OpenAI to have used them to fine-tune ChatGPT for better evaluating NLG, i.e., we can expect data leakage.
Even though we can’t verify it, I think the possibility of data leakage should always be mentioned in papers and reports exploiting GPT models to inform the reader. We don’t know the training data of GPT models.
Conclusion
Evaluating NLG is an interesting application for ChatGPT and it seems to work very well for this purpose.
The results are very convincing thanks to the use of state-of-the-art metrics as baselines.
Can ChatGPT be used to evaluate generated text?
At this point, and given the evidence presented in this study, I would answer “yes”. But we definitely need more experiments with more diverse datasets to fully confirm it.
In the same line of work, there is also the study conducted by Microsoft (Kocmi and Federmann, 2023) on the ability of ChatGPT to evaluate machine translation quality. I’ll deeply investigate it in one of my next articles.
If you like this article and would be interested to read the next ones, the best way to support my work is to become a Medium member using this link:
If you are already a member and want to support this work, just follow me on Medium.