Summary

The undefined website provides an explanation of the ROUGE metric, a set of evaluation metrics used for assessing the quality of automatic summarization and machine translation outputs by comparing n-grams, longest common subsequences, and skip-grams against reference texts.

Abstract

The undefined website delves into the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric, which is a pivotal tool in the field of Natural Language Processing (NLP) for evaluating the performance of summarization and translation models. It details various ROUGE metrics such as ROUGE-N, ROUGE-L, and ROUGE-S, explaining their methodologies for comparing model outputs with human-generated references through n-gram matching, longest common subsequences, and skip-gram concurrences. The article also discusses the pros and cons of using ROUGE, highlighting its positive correlation with human judgment and its language independence, while noting its limitation in capturing semantic meaning. Additionally, it contrasts ROUGE with the BLEU metric, emphasizing ROUGE's focus on recall over precision. The article concludes by guiding readers on how to compute ROUGE using the Python rouge library and encourages further learning through NLPlanet's resources.

Opinions

The ROUGE metric is praised for its positive correlation with human evaluation and its cost-effectiveness in computational terms.
ROUGE is acknowledged for being language-independent, enhancing its applicability across different languages.
A limitation of ROUGE is identified in its inability to account for words with similar meanings, as it relies on syntactic matches.
The article suggests that ROUGE and BLEU metrics are complementary, each with its own focus on recall and precision, respectively.
The Python rouge library is recommended for its ease of use in implementing ROUGE metrics.
Readers are encouraged to follow NLPlanet for more insights and resources in the field of NLP.

Two minutes NLP — Learn the ROUGE metric by examples

ROUGE-N, ROUGE-L, ROUGE-S, pros and cons, and ROUGE vs BLEU

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), is a set of metrics and a software package specifically designed for evaluating automatic summarization, but that can be also used for machine translation. The metrics compare an automatically produced summary or translation against reference (high-quality and human-produced) summaries or translations.

In this article, we cover the main metrics used in the ROUGE package.

ROUGE-N

ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.

Consider the reference R and the candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

ROUGE-1

Using R and C, we are going to compute the precision, recall, and F1-score of the matching n-grams. Let’s start computing ROUGE-1 by considering 1-grams only.

ROUGE-1 precision can be computed as the ratio of the number of unigrams in C that appear also in R (that are the words “the”, “cat”, and “the”), over the number of unigrams in C.

ROUGE-1 precision = 3/5 = 0.6

ROUGE-1 recall can be computed as the ratio of the number of unigrams in R that appear also in C (that are the words “the”, “cat”, and “the”), over the number of unigrams in R.

ROUGE-1 recall = 3/6 = 0.5

Then, ROUGE-1 F1-score can be directly obtained from the ROUGE-1 precision and recall using the standard F1-score formula.

ROUGE-1 F1-score = 2 * (precision * recall) / (precision + recall) = 0.54

ROUGE-2

Let’s try computing the ROUGE-2 considering 2-grams.

Remember our reference R and candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

ROUGE-2 precision is the ratio of the number of 2-grams in C that appear also in R (only the 2-gram “the cat”), over the number of 2-grams in C.

ROUGE-2 precision = 1/4 = 0.25

ROUGE-1 recall is the ratio of the number of 2-grams in R that appear also in C (only the 2-gram “the cat”), over the number of 2-grams in R.

ROUGE-2 recall = 1/5 = 0.20

Therefore, the F1-score is:

ROUGE-2 F1-score = 2 * (precision * recall) / (precision + recall) = 0.22

ROUGE-L

ROUGE-L is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences.

We can compute ROUGE-L recall, precision, and F1-score just like we did with ROUGE-N, but this time we replace each n-gram match with the LCS.

Remember our reference R and candidate summary C:

R: The cat is on the mat.
C: The cat and the dog.

The LCS is the 3-gram “the cat the” (remember that the words are not necessarily consecutive), which appears in both R and C.

ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in C.

ROUGE-L precision = 3/5 = 0.6

ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in R.

ROUGE-L recall = 3/6 = 0.5

Therefore, the F1-score is:

ROUGE-L F1-score = 2 * (precision * recall) / (precision + recall) = 0.55

ROUGE-S

ROUGE-S allows us to add a degree of leniency to the n-gram matching performed with ROUGE-N and ROUGE-L. ROUGE-S is a skip-gram concurrence metric: this allows to search for consecutive words from the reference text that appear in the model output but are separated by one-or-more other words.

Consider the new reference R and candidate summary C:

R: The cat is on the mat.
C: The gray cat and the dog.

If we consider the 2-gram “the cat”, the ROUGE-2 metric would match it only if it appears in C exactly, but this is not the case since C contains “the gray cat”. However, using ROUGE-S with unigram skipping, “the cat” would match “the gray cat” too.

We can compute ROUGE-S precision, recall, and F1-score in the same way as the other ROUGE metrics.

Pros and Cons of ROUGE

This is the tradeoff to take into account when using ROUGE.

Pros: it correlates positively with human evaluation, it’s inexpensive to compute and language-independent.
Cons: ROUGE does not manage different words that have the same meaning, as it measures syntactical matches rather than semantics.

ROUGE vs BLEU

In case you don’t know the BLEU metric already, I suggest that you read the companion article Learn the BLEU metric by examples to get a grasp on it.

In general:

BLEU focuses on precision: how much the words (and/or n-grams) in the candidate model outputs appear in the human reference.
ROUGE focuses on recall: how much the words (and/or n-grams) in the human references appear in the candidate model outputs.

These results are complementing, as is often the case in the precision-recall tradeoff.

Computing ROUGE with Python

Implementing the ROUGE metrics in Python is easy thanks to the Python rouge library, where you can find ROUGE-1, ROUGE-2, and ROUGE-L. Although present in the rouge paper, ROUGE-S would seem that over time it has been used less and less.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP — Learn the ROUGE metric by examples

ROUGE-N, ROUGE-L, ROUGE-S, pros and cons, and ROUGE vs BLEU

ROUGE-N

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-S

Pros and Cons of ROUGE

ROUGE vs BLEU

Computing ROUGE with Python

Two minutes NLP — Learn the BLEU metric by examples

BLEU, n-grams, geometric mean, and brevity penalty

Awesome NLP — 18 High-Quality Resources for studying NLP

Tutorials, code examples, video courses, course notes, and articles

Two minutes NLP — Gopher Language Model performance in a nutshell

Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG