Summary

SpanBERT is a variant of BERT that improves pre-training by representing and predicting spans, outperforming BERT in extractive question answering tasks.

Abstract

SpanBERT is a variant of BERT that is designed to better represent and predict spans of text, consistently outperforming BERT in extractive question answering tasks. Unlike BERT, which masks individual tokens, SpanBERT masks random contiguous spans. It also introduces a novel span-boundary objective (SBO) to learn to predict the entire masked span from the observed tokens at its boundary. The training loss consists of two terms: masked language modeling loss (MLM) and SBO loss. SpanBERT predicts a token by using its span embedding and the token position in the span, resulting in multiple representations for a token depending on the spans it is in.

Opinions

SpanBERT is an improvement over BERT for tasks involving reasoning about relationships between two or more spans of text.
SpanBERT's masking scheme of masking random contiguous spans, rather than random individual tokens, is more effective for tasks such as extractive question answering.
The introduction of the span-boundary objective (SBO) allows SpanBERT to learn to predict the entire masked span from the observed tokens at its boundary.
The training loss in SpanBERT consists of two terms: masked language modeling loss (MLM) and SBO loss, which allows for better representation and prediction of spans of text.
SpanBERT's ability to predict a token by using its span embedding and the token position in the span results in multiple representations for a token depending on the spans it is in.
SpanBERT has been shown to outperform BERT in various extractive question answering tasks.
The author encourages readers to reach out with any questions or suggestions and provides their email and LinkedIn profile for contact.

Do You Know About SpanBERT?

a variant of BERT for Span level masking

We know that pre-training methods such as BERT have shown great performance using self-supervised training approach. However, many NLP tasks still involve reasoning about relationships between two or more spans of text.

One of these tasks is “extractive question answering”. An example of this task that was mentioned in the SpanBert [1] paper is the following:

example: “Which NFL team won Super Bowl 50”?

Answering this question require the model to extract the relationship that “Denver Broncos“ is a type of “NFL team”. If we do token masking, and mask individual tokens model will not learn the relation between “Denver Broncos“ and “NFL team”. We need to mask the these phrases. That’s what span masking is and that’s what SpanBERT [1] do!

SpanBERT Model

SpanBERT is a variant of BERT that is designed to a better represent and predict spans of text. It consistently outperforms BERT and it differs from BERT in two ways:

Masking scheme
Training objectives

The Masking Scheme

SpanBERT masks random contiguous spans, rather than random individual tokens.

A span is defined by its starting token and its length.

Starting token: To decide the starting token, SpanBERT selects a token uniformly at random. This token has to be the begining of one word, not a middle subword. The reason for this is that spanBERT aims to mask all whole words.
Span length: The span length is sampled from a geometric distribution with p=0.2. Note the mean of this distribution is 1/p=5. Also the distribution is skewed towards shorter length (see the image below). So we are more likely to select shorter spans. In addition, SpanBERT clips the span length at 10, as we see the image below.

This shows the geometric distribution is skewed towards shorter span — Image from [1]

Note that the p is a parameter of the geometric distribution and can be set to any value in (0,1). The p=0.2 is set experimentally.

Also note once a starting token of a span and its length are decided, we will mask all tokens in a span with [MASK] token.

The Training Objectives

SpanBERT introduces a novel span-boundary objective (SBO) so the model learns to predict the entire masked span from the observed tokens at its boundary.

The training loss consists of two terms:

MLM loss = masked language modeling loss
SBO loss = span boundary objective loss

The MLM loss is the loss function used in training BERT; i.e. trying to predict the masked token from its embedding vector. It is cross-entropy loss.

The SBO loss is the new loss term that predicts the masked token given three things:

beginning of the span,
and ending of the span, and
the masked token position

For example, consider the following sentence:

sentence: “Super Bowl 50 was an American football game to determine the champion.”

If we mask the span of “an American football game”, then

for the “football” word that is masked,

beginning of the span = x4; this is the embedding of the token before “an” . This is the external boundary at the start.
and ending of the span = x9; this is the embedding of the token after “game”. This is the external boundary at the end.
the masked token position = P3; this is the position embedding from the starting token in the span. Note “football” is the third word from the start of the span.

SpanBERT assumes a span can be represented using its boundaries. So if x4 and x9 are boundaries for the “an American football game” span, then we should be able to have a good representation for this span using embeddings of “was” which is x4, and the embeddings of “to” which is x9.

Next, we should be able to predict a token given embedding of its span and the position of the token within that span. In other words, we should be able to predict “football” given its position embedding P3, and embedding of the span.

So a token representation will be a function of its external boundary representations and the position within the span. The SpanBERT implemented this function as a two-layer fully connected network with GELU activation functions:

We then use the vector representation y_i to predict the token x_i, using cross-entropy loss, just like MLM loss.

The final loss would be MLM + SBO loss over all tokens. For the “football” word it is:

We would compute this loss for all tokens in the sequence and sum it up as final loss.

Conclusion

SpanBERT is a variant of BERT that masks a whole span. It trains the model using a new objective called SBO loss; it combines it with regular MLM loss. Both losses (MLM loss and SBO loss) are cross-entropy loss.

SpanBERT predicts a token by using its span embedding and the token position in the span. As a result, a token can have multiple representation depending on the spans it is in. For each span a token is inside, it has a different representation.

SpanBERT is shown to outperform BERT in various extractive question answering tasks.

If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/

References

SpanBERT: Improving Pre-training by Representing and Predicting Spans