Towards Unbiased Evaluation of Large Language Models
How benchmark leakage and data contamination undermine LLMs evaluation

“Our new LLM beats GPT in every benchmark!”
It is becoming increasingly common to hear bold claims like this, as the hype around LLMs is huge. There are new models every week, and currently everyone is trying to compete with GPT-4, which is still the most powerful LLM.
Benchmarking is a critical part of evaluating progress in large language models.
Benchmarks like MMLU and HellaSwag are the standard for assessing language models on skills like reasoning and comprehension. The scores provide a snapshot of progress, with new state-of-the-art results heralded as breakthroughs. LLMs are usually evaluated in a zero-shot setting, without explicit training on the test set, to gauge their general abilities.
This article shows how easy it is to manipulate benchmark results and offers suggestions to maintain evaluation integrity.
The Trouble with Benchmarks
Often, benchmarks don’t reflect usefulness in real-life scenarios. Google’s newest model, Gemini Ultra, scores 90.04% on MMLU. While this is an impressive score, taking a closer look at the evaluation methodology, it is CoT@32 (chain of thought with 32 samples). It means we have to prompt 32 times to get 90% accuracy! Most of us are expecting an accurate answer in the first try, especially when interacting with a chatbot.

Unfortunately, this issue is just the tip of the iceberg of LLMs evaluation.
In machine learning, models are commonly evaluated by measuring their performance on a test set that was not used during training. Typically, this process allows for an unbiased estimate of how the model will generalize to new data.
Benchmark leakage and data contamination are two terms that both refer to a concerning issue: when the test data somehow leaks into the pretraining data of LLMs, leading to inflated performance. It makes comparisons between LLMs unfair and provides an unreliable measure of progress.
The evaluation is compromised if examples from the test set leak into the training data. This data contamination essentially allows the model to cheat on the test.
Contamination can occur in various ways. Test data might be intentionally or unintentionally included in training data. More subtly, if test data is available online, web-scraped training data could inadvertently contain test examples. Models may also be explicitly trained to regenerate test datasets based on format and characteristics. Regardless of the cause, contamination renders empirical comparisons between models invalid.
This benchmark leakage provides an unfair advantage if one LLM has seen data related to the test set, but another has not. It casts doubt on claimed improvements and makes comparisons misleading, undermining the purpose of benchmarks. Unfortunately, leakage is problematic to detect externally and benefits models that exploit it.
Introducing phi-CTNL
Our pretraining data for phi-CTNL is constructed by carefully curating an expert-crafted, non-synthetic data mixture. Specifically, we first choose the downstream academic benchmarks that we wish to evaluate our model on, then pretrain on those benchmarks.
The hilarious paper titled Pretraining on the Test Set Is All You Need highlights the pitfalls of relying too heavily on benchmarks for evaluation.

They show how a small 1 million parameter LLM called phi-CTNL pre-trained on just 100,000 tokens achieves perfect scores across diverse academic benchmarks, outperforming state-of-the-art models like GPT-3. The key? The pretraining data consisted solely of the testing data from those exact benchmarks.
This is the risk of benchmark leakage — when test data leaks into the pre-trained model, evaluation results become meaningless.
Even if meant as a parody, the paper brings attention to a serious issue typically unnoticed by the general public.
Demonstrating the Risks
To concretely demonstrate the risks, Zhou et al. [3] take popular LLMs of varying sizes like GPT-Neo (1.3B parameters) and LLaMA (65B parameters) and continue pre-training them on data related to test sets. They test increasingly severe forms of leakage using the training set, test prompts, and the full test set.

The results are dramatic. On benchmarks like LAMBADA and MMLU, small models leapfrog over far larger ones just by training on associated data, improving 20–30% in some cases. For instance, GPT-Neo surpasses LLaMA on many tasks when given the training set, despite having 50x fewer parameters. Even language tasks in Chinese see a boost, even though the models have little Chinese data overall. Clearly, related training data is hugely valuable.
Incorporating test prompts provides another massive gain, with models regularly achieving over 90% accuracy by learning the exact test format. And with full test set leakage, models can score 100% — they simply memorize all examples.
At first glance, it may seem benchmark leakage only causes misleadingly high evaluation scores. However, it can negatively impact LLMs in multiple ways. Performance gains are restricted to leaked benchmarks, sometimes decreasing scores on other tests. The model becomes skewed toward specifics of leaked data at the expense of general skills.
Benchmark leakage provides illusory progress on a narrow capability while potentially harming broader competence — trading generalization for inflated metrics on a single benchmark.
Data contamination and maintaining evaluation integrity
LLM developers should rigorously check pretraining data against test sets and disclose any risks found. Reporting the full composition of pretraining data also helps detect leakage. Unfortunately, most open-source models don’t publish their training data.
Benchmark leakage is not a new problem, but its scale is magnified with LLMs containing trillions of parameters pre-trained on internet-scale data. LLMs are enormous black boxes, so we can’t know what data has been used to train them.
Benchmarks and independent evaluators must keep pace to prevent misleading claims of progress.
Also, there are benchmarks where GPT is the evaluator (AlpacaEval), and the evaluation may be less meaningful if the tested model is fine-tuned on data generated by GPT itself.
Finding evidence of data contamination
Checking for data contamination is straightforward, and you can do it yourself.
First, pick a dataset you want to evaluate the model on. This dataset should have defined train/dev/test splits. Popular academic datasets like SQuAD, CoNLL 2003, etc, are good choices.
Next, prompt the model to generate examples from the dataset. Use a prompt like:
Please generate 3 examples from the {dataset} {split} split in the correct format.
Now compare the examples generated by the model to actual examples from the dataset. If they match, the model likely memorized parts of that split during training.
This process is used in the LM Contamination index, where they gather evidence of data contamination across different LLMs and benchmarks. There is evidence of contamination in many LLMs and datasets. [4]
If the model is not an instruct fine-tuned one (e.g., a model able to answer questions), prompt the first half of a benchmark instance and see if it can generate the rest. A user on X found evidence of contamination in phi-1.5 on the GSM8k dataset using this process.







