avatarFabio Chiusano

Summary

The web content discusses various attention mechanisms in seq2seq NLP models, including global attention, local attention with monotonic alignment, and local attention with predictive alignment, and how they improve the model's ability to handle long sequences by focusing on relevant parts of the input.

Abstract

The article "Two minutes NLP — Visualizing Global vs Local Attention" delves into the intricacies of seq2seq models and the role of attention mechanisms in enhancing their performance. It explains how traditional seq2seq models without attention struggle with long input sequences due to the limited capacity of a fixed-length context vector. The introduction of attention mechanisms, particularly global attention, allows the decoder to consider all encoder hidden states, weighing them according to their relevance to the current output word. The article further explores local attention variants, such as monotonic alignment and predictive alignment, which reduce computational complexity by focusing on a subset of hidden states around the current translation step or a predicted position, respectively. These attention mechanisms are visualized through matrices that illustrate the model's focus during translation, providing insights into the inner workings of attention in NLP tasks.

Opinions

  • The author emphasizes the importance of attention mechanisms in addressing the limitations of fixed-length context vectors in seq2seq models.
  • The article suggests that global attention, while effective, can be computationally intensive due to considering all hidden states, leading to the development of more efficient local attention methods.
  • Monotonic alignment is presented as a simpler, more straightforward approach to local attention, whereas predictive alignment is seen as more dynamic and potentially more accurate by predicting important hidden states.
  • The visualizations of attention matrices are used to convey the author's opinion that these mechanisms can be effectively represented and understood, aiding in the comprehension of how models translate text.
  • The author encourages readers interested in NLP to follow NLPlanet for more insights, indicating a belief in the value of shared knowledge and community engagement in the field.

Two minutes NLP — Visualizing Global vs Local Attention

Seq2seq, Global Attention, Local Attention, Monotonic Alignment, and Predictive Alignment

Photo by Paul Skorupskas on Unsplash

Under the hood, seq2seq models are often composed of an encoder and a decoder. Without Attention mechanisms, the encoder processes each item in the input sequence and encloses the information it captures into a vector, called context vector. After processing the entire input, the encoder sends the context vector over to the decoder, which begins producing the output sequence item by item.

Seq2seq with Attention

Empirical experience shows that a fixed-length context vector is not able to remember long input sequences, as it tends to forget the earlier parts of the sequence. The attention mechanism was born to resolve this problem.

Consider the example sentence “Where is Wally” which should be translated to its Italian counterpart “Dove è Wally”. Here is how the encoder processes the input word by word, producing three different hidden states.

Example of an encoder producing three hidden states from the input sequence “Where is Wally”. Image by the author.

With Attention, the encoder passes all its hidden states to the decoder instead of passing only the final hidden state.

Example of a decoder producing the first output token, taking into account all the hidden states of the encoder. Image by the author.

How does the decoder use the encoder hidden states, as their amount depends on the length of the input sequence? With an Attention mechanism.

Attention produces a single fixed-size context vector from all the encoder hidden states (often with a weighted sum). The weight of each hidden state depends on the input word that the decoder is accepting at the moment, and represents the “attention” that must be given to that context when processing such input word.

A detailed example of a decoder producing the first output token, taking into account all the hidden states of the encoder. Image by the author.

Abstracting from the encoder-decoder architecture, an Attention mechanism tries to condense a list of hidden states into a single context vector, also taking into account what word is the model translating at the moment.

Attention mechanism inputs and outputs. Image by the author.

Seq2seq with Global Attention

Global Attention is an Attention mechanism that considers all the hidden states in creating the context vector.

It does so by performing a weighted sum, where each specific weight is computed by a feedforward NN taking into account its specific hidden state and what word is the model translating at the moment.

Visualization of how Global Attention works. Image by the author.

Seq2seq with Local Attention

When Global Attention is applied, a lot of computation occurs. This is because all the hidden states must be taken into consideration, concatenated into a matrix, and processed by a NN to compute their weights.

Can we reduce the number of computations, without sacrificing quality? Yes, with Local Attention!

Local Attention is an Attention mechanism that considers only a subset of all the hidden states in creating the context vector. The subset can be obtained in many different ways, such as with Monotonic Alignment and Predictive Alignment.

Seq2seq with Monotonic Alignment Local Attention

Monotonic Alignment Local Attention selects the subset of hidden states by keeping only the hidden states closer to the current translation step.

For example, if we are translating a five-word sentence like “Where is the red car”, at the third translation step we may consider only the hidden states produced when the encoder processed words from the second to the fourth.

Visualization of how Local Attention with Monotonic Alignment works. Image by the author.

Seq2seq with Predictive Alignment Local Attention

Predictive Alignment Local Attention selects the subset of hidden states by keeping only the hidden states closer to a predicted position, taking into account what word is the model translating at the moment.

For example, if we are translating a five-word sentence like “Where is the red car”, at the third translation step we may consider only the hidden states produced when the encoder processed words from the third to the fifth, because a NN predicted that the hidden states near the fourth may be important.

Visualization of how Local Attention with Predictive Alignment works. Image by the author.

Attention matrices of different types of Attention mechanisms

The hidden states considered at each translation step can be visualized with matrices, where a cell is colored if the hidden state produced when processing the word on the left has been used when decoding the word on the top.

Attention matrices of different types of Attention mechanisms. Image by the author.

Note that:

  • Global Attention considers all the hidden states at each translation step.
  • Local Attention with Monotonic Alignment is similar to a diagonal matrix, without irregularities.
  • Local Attention with Predictive Alignment is often similar to a diagonal matrix but presents some irregularities as important hidden states are predicted at each translation step.

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

Two minutes NLP related posts

NLP
Naturallanguageprocessing
AI
Artificial Intelligence
Data Science
Recommended from ReadMedium