A High Level Guide to LLM Evaluation Metrics

Developing an understanding of a variety of LLM benchmarks & scores, including an intuition of when they may be of value for your purpose

It seems that almost on a weekly basis, a new large language model (LLM) is launched to the public. With each announcement of an LLM, these providers will tout performance numbers that can sound pretty impressive. The challenge that I’ve found is that there is a wide breadth of performance metrics that are referenced across these press releases. While there are a few that show up more often than the others, there unfortunately is not simply one or two “go to” metrics. If you want to see a tangible example of this, check out the page for GPT-4’s performance. It references many different benchmarks and scores!

The first natural question one might have is, “Why can’t we simply agree to use a single metric?” In short, there is no clean way to assess LLM performance, so each performance metric seeks to provide a quantitative assessment for one focused domain. Additionally, many of these performance metrics have “sub-metrics” that calculate the metric slightly differently than the original metric. When I originally started performing research for this blog post, my intention was to cover every single one of these benchmarks and scores, but I quickly discovered if I were to do so, we’d be covering over 50 different metrics!

Because assessing each individual metric isn’t exactly feasible, what I discovered is that we can chunk these various benchmarks and scores into categories of what they are generally trying to assess. In the remainder of this post, we will cover these various categories and also provide specific examples of popular metricsthat would fall under each of these categories. The goal of this post is that you can walk away from this post with a general sense of which performance metric you assessing for your specific use case.

The six categories we’ll assess in this post include the following. Please note: there isn’t particularly an “industry standard” on how these categories were created. These categories were created by how I hear them referenced most often:

General knowledge benchmarks
Logical reasoning benchmarks
Coding benchmarks
Homogeneity (“similarity”) scores
Standardized tests
LLM leaderboards

While it goes beyond the scope of this post, I would encourage you to check out my GitHub repository where I am assembling the code to calculate a variety of these LLM benchmarks and scores. At the time of this post’s publication, this GitHub repository is currently in an early state, so please check back over time as I continue to add new code.

With that, let’s jump into our first category of LLM benchmarks!

Category 1: General Knowledge Benchmarks

As the name implies, this set of LLM benchmarks include metrics that assess how generally knowledgeable an LLM is. Generally speaking, this is the type of metric that foundation models choose as a “go to” metric to demonstrate the performance of their model. It’s understandable why this would be a popular metric, as they can generally assess the effectiveness of a model across a broad range of topics.

These “breadth of knowledge” benchmarks are perhaps most important when it comes to use cases that want to create a general purpose chatbot interface, like ChatGPT or Anthropic’s Claude. In these use cases, you want the model to have a lot of general purpose knowledge since the chatbot is likely going to receive all different types of questions, ranging from food recipe generation to solving math problems. Inversely, you would generally not care about these sorts of metrics if you’re working in a use case that is very domain specific. For example, if you have a use case that is looking to solve questions about the legal domain, you probably don’t care at how good of a chef ChatGPT can be!

Before we jump into some of the specific benchmarks under this category, let us list out a few honorable mentions: NaturalQuestions.

MMLU

MMLU stands for Massive Multitask Language Understanding, and it is perhaps the most popular metric used across model cards to demonstrate a model’s performance in terms of knowledge breadth. This benchmark contains a series of scenarios and questions for the LLM to answer across 57 different domains. These domains include STEM, humanities, social sciences, and more. Within each of these domains, there include questions that range from more generalized areas, like history of the topic, and then there are questions that are more specialized in nature or ask “harder” questions, like ethical implications.

These questions are multiple choice in nature, so for each question in the 57 different knowledge domains, the MMLU dataset contains four “A, B, C, D” choices that the model can choose from. For example, here is a precise question from the MMLU dataset:

Question: Typical advertising regulatory bodies suggest, for example that adverts must not: encourage ______, cause unnecessary ______ or ______, and must not cause ______ offence.

Choices:

A. Unsafe practices, Wants, Fear, Trivial

B. Unsafe practices, Distress, Fear, Serious

C. Safe practices, Wants, Jealousy, Trivial

D. Safe practices, Distress, Jealousy, Serious

Correct Answer: B

The precise syntax for how to feed this question and choices into the LLM doesn’t follow what I shared above, and MMLU is one of those benchmarks that has fragmented into many sub-benchmarks, each with a slightly different take on the original MMLU benchmark. (And no, we’re not feeding the correct answer into the LLM. 😁)

As this benchmark seeks to provide a quantitative score in the end, the calculation for this benchmark is very straightforward. Simply put, MMLU seeks to calculate the average (mean) for how many questions it answered correctly for each of the 57 domains, and the final MMLU score is the mean of those means.

TriviaQA

As the name implies, is a massive dataset containing trivia-like questions and their correlative answers. The question-answer pairs were collected from Wikipedia and all across the internet, in a total of 600k+ documents. The final evaluation score is pretty simple: it’s just the percentage of questions that were answered correctly. The challenging part is evaluating the correctness of the answer.

For example, the following is a real question from TriviaQA:

Q: Which American-born Sinclair won the Nobel Prize for Literature in 1930?

Here is the answer provided in the (HuggingFace) TriviaQA dataset:

{
  "aliases": [
    "(Harry) Sinclair Lewis",
    "Harry Sinclair Lewis",
    "Lewis, (Harry) Sinclair",
    "Grace Hegger",
    "Sinclair Lewis"
  ],
  "normalized_aliases": [
    "grace hegger",
    "lewis harry sinclair",
    "harry sinclair lewis",
    "sinclair lewis"
  ],
  "matched_wiki_entity_name": "".
  "normalized_matched_wiki_entity_name": "",
  "normalized_value": "sinclair lewis",
  "type": "WikipediaEntry",
  "value": "Sinclair Lewis
}

No matter how you slice it, getting TriviaQA to work is non-trivial! Providing this much information in an answer like this can be helpful, but trying to parse out the answer from the LLM and making sure it matches appropriately can be a very challenging task.

Category 2: Logical Reasoning Benchmarks

Where our first set of benchmarks sought to quantify how well the LLM understands knowledge across different domains, this next set focuses less on “breadth” of understanding and more on “depth” of understanding. Logical reasoning is an interesting concept in LLMs because technically speaking, there is no logical reasoning in LLMs. LLMs today are simply next word predictors that derive their probabilities of what should come next from the data the LLM was exposed to at the time of training. Regardless, it is very interesting how LLMs can still exhibit some semblance of logical reasoning in what I like to refer to as “emergent behavior.”

As you can guess, logical reasoning benchmarks are ideal when you have a use case where you want to use an LLM for reasoning purposes. Maybe you’re operating in a use case where you want an LLM to directly interact with a customer and take action on those customers’ requests. It’s not simply enough to have a breadth of knowledge, as we assessed in the first category. In these sort of use cases, you might want the LLM to precisely and accurately reason through a problem to take the right level of action at the right time. (Note: While these are not necessarily popular use cases as of this post’s publication, they will certainly become more important in the continued rise of autonomous agents.)

In addition to the specific benchmarks that we’ll cover below, here are a few honorable mentions: HELM, MATH, WinoGrande, AI2 Reasoning Challenge (ARC), DROP, GLUE, SuperGLUE, CommonSenseQA, BoolQ, QuAC.

HellaSwag

I know, I know… that’s a name! To be clear, I intentionally selected metrics that are more popularly used than others, and as luck would have it, HellaSwag is one of the most popular benchmarks out there for assessing commonsense knowledge. HellaSwag technically is an acronym that stands for “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations.”

That’s a mouthful, but let’s break that down since it actually does make sense when you see how HellaSwag works. In each entry of the dataset, an elaborate prompt is provided alongside four potential answers (“completions”) to the prompt. As you can guess, three of the four are incorrect, and those incorrect three answers were generated using a concept called adversarial filtering.

Let’s see what one of these scenarios looks like.

A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She…

A. rinses the bucket off with soap and blow dry the dog’s head.

B. use a hose to keep it from getting soapy.

C. gets the dog wet, then runs away again.

D. gets into a bath tub with the dog.

The correct answer is C.

Let’s revisit the name “Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations” and break it down:

Harder endings: The potential options given (the ABCD answers) should be logically challenging for an LLM to solve.
Longer contexts: The originating prompt is longer.
Low-shot activities: The LLM may have to take multiple actions.
for Situations with Adversarial Generations: 3 of the 4 answers are incorrect, and these incorrect answers were generated with a concept called adversarial filtering

The final HellaSwag score is calculated by a percentage of correct answers.

GSM8k

Standing for “Grade School Math 8K”, GSM8k is a dataset that contains 8,500 math word problems. As the title indicates, the level of math in this benchmark only goes through a grade school level, meaning that it only covers things that require at most simple algebra. These problems can require anywhere between 2 and 8 steps to complete, each of these steps involving things like simple arithmetic (adding, subtracting, multiplying, dividing). The final GSM8k score is simply a percentage of how many questions the LLM answered correctly.

Here is an example question with its answer:

Q: Martha has 18 crayons. She lost half of them, so she bought a new set of 20 crayons. How many crayons in total does Martha have after the purchase?

A: 29

Part of the challenge running this benchmark is that many LLMs will not give a precise answer as given in the GSM8k dataset. For example, you might imagine that an LLM could answer the question as, “Martha has 29 crayons after the purchase.” You might think it’s simple enough to regex that number out, but what happens if the LLM answers, “Martha originally had 18 crayons, but after losing half and purchasing a new set of 20, she now has 29 crayons.” Now simple regex won’t cut it here!

There unfortunately isn’t a clear cut answer how to solve this problem; however, prompt engineering can go a long way here. If you’re not comfortable with prompt engineering, you can also leverage an open source library like LangChain, which has built-in functions to do that prompting to extract the data in whatever way you like!

Category 3: Coding Benchmarks

One of the most popular use cases for LLMs is using them to assist with code debugging and completion, so it makes sense that we have specialized benchmarks that assess this very thing. Before jumping into these benchmarks, please be warned: these benchmarks require running the code outputted by an LLM. Ideally, this will be no problem, but in the case where the LLM goes haywire, you could accidentally run malicious code on your computer. That said, if you choose to run this benchmark yourself, you may want to do so on an air gapped computer. (You’d get the code to run on the air gapped computer by saving it to disk from a non-air gapped computer.) A Raspberry Pi might be great for this purpose.

Let’s check out what a few of these coding benchmarks looks like.

HumanEval

Created by OpenAI, this benchmark assesses how well an LLM can produce code to solve a problem given in Python. For example, here is one prompt in the HumanEval dataset:

from typing import List

def has_close_elements(numbers: List[float], threshold: float) -> bool:
  """
  Check if in given list of numbers, are any two numbers closer to each other
  than given threshold.

  >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False
  >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True
  """

The idea is that the LLM would be able to understand what the docstring is seeking to do including figuring out how to make the given examples work. This code would then be run by the computer to produce an answer. In the HumanEval dataset, the answer to the question comes in the form of assertion statements. For example, the test for the question above is provided as the following:

METADATA = { 'author': 'jt', 'dataset': 'test' }
def check(candidate):
  assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True
  assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False
  assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True
  assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False
  assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True
  assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True
  assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False

Notice that there are many tests per this single question, so how do we effective note if this particular question is a pass or fail? OpenAI defines a “pass@k” argument, which basically notes that you can have it set to pass or fail at whatever you set k to be. For example, if you set k to equal 3, then the code only needs to pass 3 tests to be considered an overall pass.

MBPP

Standing for Mostly Basic Python Programming, Google Research’s MBPP is very similar in nature to OpenAI’s HumanEval. At first blush, they appear to be the same. Here’s an example of an entry found in MBPP.

{
    'task_id': 1,
    'text': 'Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][].',
    'code': 'R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] = tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] \r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]',
    'test_list': [
        'assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8',
        'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12',
        'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16'],
    'test_setup_code': '',
    'challenge_test_list': []
}

So what are the big differences between HumanEval and MBPP? It might be hard to tell in that image there, but MBPP consistently has three input / output examples provided as part of the input prompt. MBPP is also designed to be solved by entry level programmers, hence the name “Mostly Basic.” HumanEval isn’t as consistent with the number of examples it provides, and it also seeks to emulate more “realistic” coding challenges.

Category 4: Homogeneity Scores

Of all the categories in this post, this one is perhaps the most difficult give a “meta label.” I’m calling these homogeneity scores, but on the internet, you may find that they are more commonly referred to as summarization scores. (I actually wanted to call them “similarity scores”, but then I remembered that was taken by something else!) I personally think it’s too limiting to call them summarization scores because these same metrics can be used to assess other things, including how well a computer does at translating text from one language to another. Whether it be summarization or language translation, the commonality amongst these metrics is that they are essentially trying to determine the similarity between a given input and its output.

Also, note that we are now deviating from the “benchmark” terminology in favor of using the word “score.” This is because these are not benchmarking scores. The narrow scope in which we calculate these scores generally isn’t a good way to assess the overall “quality” of an LLM, so you won’t find these homogeneity scores listed on the performance card of a model like GPT-4. It simply doesn’t make sense in that context. But as we already noted, these “similarity” scores can be helpful in summarization and translation use cases!

BLEU

Standing for (Bilingual Evaluation Understudy), BLEU seeks to provide a quantitative numerical representation for how close a machine-generated text matches that compared to “ground truth” human-labelled data. More specifically, it attempts to look at the n-grams between the machine-generated text and human-labelled data. If you’re not familiar with n-grams, they are essentially sequences of words that are of n length. For example, if you set n to equal 2 (also known as bigrams), the n-grams in the sentence, “My name is David.” would be “My name”, “name is”, and “is David.”

Generally speaking, BLEU assesses similarity up to 4 n-grams, in what is otherwise known as BLEU-4. In other words, it’s looking at how many similar matches that can be found between the machine-generated text and ground-truth label 1-grams (unigrams), 2-grams (bigrams), 3-grams (trigrams), and 4-grams. For each n-gram, we’re deriving a ratio that is number of matching n-grams to total number of n-grams. Then after we’ve calculated the ratios for each n-gram, we derive the final BLEU score by taking the geometric mean of these ratios. This ends up producing a final BLEU score ranging between 0 and 1, where 1 indicates a perfect similarity match and 0 indicates the opposite. (There’s a little more to it than just this, but I’m keeping it simple in this post.)

Of course, BLEU doesn’t simply have to be limited to 4-grams. You could technically go all the way up to 100-grams in what one might call BLEU-100, but I hope you can understand why that would be pretty absurd! Because BLEU requires human labelled data in a very specific sort of way, you generally won’t see BLEU used to generally assess LLMs, as we might with MMLU.

(Final side note about BLEU: There is apparently a sub-variant of this score called “sacreBLEU”, which I find to be a delightful name! 😂)

ROUGE

Standing for Recall-Oriented Understudy for Gisting Evaluation, ROUGE is pretty similar to BLEU in the fact that it also works by assessing n-grams. The big difference between ROUGE and BLEU is that ROUGE takes it a step further by calculating the precision, recall, and F1 score of each of the n-grams. Folks already familiar with basic machine learning will recognize those metrics as a bit more “descriptive” than a direct “accuracy”, which you could argue is what BLEU is doing.

Why does this matter? Imagine calculating the BLEU score for the following example:

Ground truth example: “My name is David.”

Machine-generated text: “My name is David! My name is David! My name is David!”

The BLEU score will be deceptively high here. Technically speaking, the ground truth example is indeed pretty similar to the machine-translated text, but the machine-translated text is also being too verbose here by echoing itself three times. Again, this is very similar to why we use precision, recall, and F1 score over accuracy in standard machine learning contexts: while BLEU can be good in some cases, ROUGE gives a more precise representation of the situation.

While I’ve tried to steer clear of metric subvariants, I think it’s important that we briefly touch on ROUGE-L. Where our vanilla ROUGE operated on n-grams, ROUGE-L operates on sequences of words that are pretty close but not exactly side by side. Consider the following example:

Ground truth example: “I like pizza.”

Machine-generated text: “I really like pizza.”

Notice that in the machine-generated text, the word “really” splits the phrase “I like” from the ground truth example. ROUGE-4 would definitely penalize this, but since ROUGE-L is more “forgiving”, it would still look at these two phrases as being similar enough.

Category 5: Standardized Tests

This is easily the most straightforward category of them all as they are exactly as they sound: people have taken the same standardized test that humans take and evaluate how well an LLM performs on them. The value of doing this ranges from more of a curiosity to perhaps something that truly is useful to know. What I mean by that is that I really can’t imagine that anybody truly cares how well an LLM did on the SAT outside of a “gee whiz” thought. But perhaps for some of the medical exams, those might actually be of value if a team of medical doctors is interested in using an LLM to assist with their work.

Because these are indeed well documented standardized tests, I am not going to share how they operate in this post. Instead, we’ll quickly wrap up this category by listing out a few of these standardized tests you’ll see on model performance cards:

Uniform Bar Exam (MBE + MEE + MPT)
LSAT
SAT Evidence-Based Reading & Writing
SAT Math
GRE
Medical Knowledge Self-Assessment Program
Advanced Sommelier (theory knowledge)

(Yes, that last one actually shows up in GPT-4’s model card!)

Category 6: LLM Leaderboards

This final category doesn’t really consist of any sort of tangible metric, but I still felt it important to address them since they can be a bit confusing to understand. At this point in the blog post, you are well aware that there is no singular metric to determine the quality of an LLM. Even though we can quantify performance in more narrow categories, it would be wrong to try jamming all those numbers into a single new number.

Because of this, LLM leaderboards are rather arbitrary in how they define top performers. This isn’t to say that their assessment is in correct, but if you look from one leaderboard to another, don’t expect them to assess LLMs on the same criterion. In fact, some of them go out of their way to tout their evaluative effectiveness over other leaderboards.

Before reviewing two of the most popular leaderboards, let’s give honorable mentions in this category: Chatbot Arena, Lakera’s List of Open Source LLMs, Accubits Large Language Models Leaderboard.

HuggingFace Open LLM Leaderboard

Arguably the most popular leaderboard, HuggingFace is rightfully given its due given its prevalence driving forward the NLP community. HuggingFace focuses on open source LLMs, so perhaps one unfortunate downside to this leaderboard is that you won’t see closed source models like OpenAI’s GPT-4 nor Anthropic’s Claude.

On the backend of HuggingFace’s Open LLM leaderboard is EleutherAI’s Evaluation Harness. This Evaluation Harness is essentially a means to run 60+ evaluation metrics, several of which we covered in this blog post. The other interesting thing to note is that EleutherAI’s Evaluation Harness can be used in contexts outside of the HuggingFace leaderboard, including on closed-source models like OpenAI’s. This may be a great way for you to get up and going quickly!

Toloka LLM Leaderboard

While HuggingFace takes the approach of assembling their leaderboard using an amalgamation of existing metrics, Toloka takes a more nuanced approach. Specifically, Toloka uses human evaluators to do a more qualitative assessment. Naturally, using human evaluators is an involved process, so Toloka does not seem to accept just any model interested in getting onto their leaderboard. Toloka’s idea here is to hopefully provide business leaders with a more deeper, thorough analysis on the effectiveness of an LLM in a specific domain. To learn more about how they rank their leaderboard, check out this link.

That brings us to the end of this post! Keep in mind, we only evaluated metrics on the effectiveness of large language models. There are also metrics out there for multimodal models and more. I hope that this post gives you at least a sense of how you might choose to evaluate an LLM for your own effective usage!