avatarMs Aerin

Summary

The website content explains perplexity in the context of Natural Language Processing (NLP), detailing its definition, derivation, and practical significance as a metric for evaluating language models.

Abstract

Perplexity is a key measure in NLP used to assess the performance of language models, representing how well a model predicts a given sample. It is mathematically defined as the exponentiation of entropy, indicating the total number of possible choices a random variable could have. The concept is derived from entropy, which is the average number of bits needed to encode information. The website post delves into the intuition behind perplexity, its relationship with entropy and cross-entropy, and the reasons behind its use in language model evaluation. It clarifies the formula provided in NLP courses and addresses common misconceptions and practical considerations, such as the necessity of smoothing techniques to prevent infinite perplexity.

Opinions

  • Perplexity is considered easier to interpret than entropy when thought of as a 'branching factor', representing the weighted average number of choices.
  • The concept of perplexity is well-established in NLP, primarily due to historical reasons rather than being the sole reasonable metric.
  • Dr. Luke Zettlemoyer suggests that there are no profound reasons for the use of perplexity over other metrics; its widespread use is largely due to tradition and ease of interpretability.
  • The post emphasizes that a lower perplexity is preferable, indicating a more predictable and less disordered language model, which is considered favorable in NLP applications.
  • The author acknowledges that the derivations and assumptions in the NLP class slides are for illustrative purposes and may not reflect realistic scenarios, highlighting the importance of entropy and smoothing techniques in the context of NLP language models.

Perplexity Intuition (and its derivation)

Never be perplexed again by perplexity.

You might have seen something like this in an NLP class:

A slide from Dr. Luke Zettlemoyer’s NLP class

Or

A slide of CS 124 at Stanford (Dr. Dan Jurafsky)

During the class, we don’t really spend time to derive the perplexity. Maybe perplexity is a basic concept that you probably already know? This post is for those who don’t.

In general, perplexity is a measurement of how well a probability model predicts a sample. In the context of Natural Language Processing, perplexity is one way to evaluate language models.

But why is perplexity in NLP defined the way it is?

If you look up the perplexity of a discrete probability distribution in Wikipedia:

from https://en.wikipedia.org/wiki/Perplexity

where H(p) is the entropy of the distribution p(x) and x is a random variable over all possible events.

In the previous post, we derived H(p) from scratch and intuitively showed why entropy is the average number of bits that we need to encode the information. If you don’t understand H(p), please read this before reading further.

Now we agree that H(p) =-Σ p(x) log p(x).

Then, perplexity is just an exponentiation of the entropy!

Yes. Entropy is the average number of bits to encode the information contained in a random variable, so the exponentiation of the entropy should be the total amount of all possible information, or more precisely, the weighted average number of choices a random variable has.

For example, if the average sentence in the test set could be coded in 100 bits, the model perplexity is 2¹⁰⁰ per sentence.

Let’s confirm that the definition in Wikipedia matches to the one in the slides.

Where

p : A probability distribution that we want to model. A training sample is drawn from p and it’s unknown distribution.

q : A proposed probability model. Our prediction.

We can evaluate our prediction q by testing against samples drawn from p. Then it’s basically calculating the cross-entropy. In the derivation above, we assumed all words have the same probability (1 / # of words) in p.

Remarks

  • When q(x) = 0, the perplexity will be . In fact, this is one of the reasons why the concept of smoothing in NLP was introduced.
  • If we use a uniform probability model for q (simply 1/N for all words), the perplexity will be equal to the vocabulary size.
  • The derivation above is for illustration purpose only in order to reach the formula in UW/Stanford slides. In both slides, it assumes that we are calculating the perplexity of the entire corpus using a unigram model and there is no duplicated word. (It assumes the # of total words (N) is the same as the number of unique words.) Also, it assumes all words have the same probability 1/N. These are not realistic assumptions.

Takeaway

  • Less entropy (or less disordered system) is favorable over more entropy. Because predictable results are preferred over randomness. This is why people say low perplexity is good and high perplexity is bad since the perplexity is the exponentiation of the entropy (and you can safely think of the concept of perplexity as entropy).
  • A language model is a probability distribution over sentences. And the best language model is one that best predicts an unseen test set.
  • Why do we use perplexity instead of entropy? If we think of perplexity as a branching factor (the weighted average number of choices a random variable has), then that number is easier to understand than the entropy. I found this surprising because I thought there will be more profound reasons. I asked Dr. Zettlemoyer if there is any other reason other than easy interpretability. His answer was “I think that is it! It is largely historical since lots of other metrics would be reasonable to use as well!”
Machine Learning
Information Theory
NLP
Data Science
Artificial Intelligence
Recommended from ReadMedium