avatarMarvin Lanhenke

Summary

The article delves into the intricacies of the attention mechanism, particularly within transformer models, and builds upon the foundational concepts introduced in a previous installment.

Abstract

The article, titled "NLP-Day 20: You Better Pay Attention To Transformers (Part 2)," continues the exploration of attention mechanisms in NLP, focusing on the transformer attention model. It begins by recapping the general attention mechanism introduced in the previous episode, emphasizing its role in overcoming the bottleneck problem of fixed-length encoded vectors. The author then describes the computational steps of the attention mechanism, including the calculation of alignment scores, weights via softmax, and the creation of a context vector. The general attention mechanism is further detailed, introducing queries, keys, and values, and how they interact to produce attention weights and the final context vector. The transformer attention mechanism is presented as an evolution of this concept, utilizing self-attention to relate words within a sequence through query, key, and value matrices. The article explains single-head and multi-head attention, the latter allowing the model to capture information from different representation subspaces. The conclusion sets the stage for the next episode, which will dissect the transformer model further.

Opinions

  • The author suggests that the attention mechanism is crucial for allowing a decoder to dynamically select and utilize the most important parts of an input sequence.
  • The article conveys that the transformer attention mechanism is a significant advancement, as it eliminates the need for recurrence and convolutions, relying solely on self-attention.
  • The author implies that multi-head attention is superior to single-head attention due to its ability to extract diverse information from multiple representation subspaces.
  • The article positions the transformer model as a foundational component for understanding and implementing advanced NLP systems.
  • By encouraging readers to follow the ongoing series and become Medium members, the author indicates that continuous learning and engagement with the content are valuable for grasping complex NLP concepts.

#30DaysOfNLP

NLP-Day 20: You Better Pay Attention To Transformers (Part 2)

Understanding the general and transformer attention mechanism

Transformer-based architecture #30DaysOfNLP [Image by Author]

In the last episode, we gently introduced the concept of attention in general as well as in the context of machine learning. We, however, purposely stayed on the surface, allowing us to get a general overview.

Now, it’s time to get into the weeds. To dive deeper and explore the attention mechanism in greater detail.

In the following sections, we’re going to uncover the inner workings and the main computational steps of the attention mechanism. We will learn about the general and the transformer-based attention mechanisms, understanding the meaning and purpose of things like queries, keys, and values.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: You Better Pay Attention To Transformers (Part 2)

The attention mechanism

Introduced by Bahdanau et al. (2014), the attention mechanism provides a way to address the bottleneck problem that arises due to fixed-length encoded vectors, allowing the decoder to only access a limited part of the information.

The core idea to overcome this problem is to allow the decoder to dynamically select and utilize the most important parts of the input sequence. This can be achieved by the computation of a context vector that represents the attention by a weighted combination of the encoded input vector.

The attention mechanism consists of 3 different computational steps.

Basic attention mechanism [Image By Author]

First of all, we have to calculate the alignment scores that tell us how well the elements of the input sequence align with the output at the current time step. We can implement this function by a feedforward network that takes in the encoded hidden states and the decoder’s previous output as input.

The second step involves the computation of the weights which can be achieved by applying a softmax function to the previously calculated alignment scores.

At last, we create a context vector by taking the weighted sum of all encoder hidden states at each time step.

These steps define the nuts and bolts of the attention mechanism. However, the procedure can be further generalized, where the information is not related in a sequential fashion.

The general attention mechanism

Similar to the attention mechanism, the general version is also defined by three components: The queries, keys, and values.

The query contains the decoder’s output from the previous time step whereas the values are defined by the encoded inputs. We can think of the whole process as some kind of query against a database with key-value pairs. Where the keys are the vectors and the values are the encoded hidden states.

Next, we follow the same computational steps as described earlier.

First of all, we match each query vector against a key to compute a score value which is defined by the dot product of the query and the corresponding key vector.

Once again we apply a softmax function to calculate the weights.

The next step involves the computation of the generalized attention by taking the weighted sum of the value vector with the associated key.

Let’s consider a machine translation task as an example.

We take a specific word from an input sequence and make use of its query vector to score it against each key in the database. Doing so, allows us to capture the information of how our specific word relates to other words in the sequence. Next, we simply scale the values according to the attention weights, enabling us to retain focus on the most relevant words.

Transformer attention

Now, the attention mechanism is taken even one step further.

The approach, described in Attention Is All You Need by Vaswani et al. (2017), revolutionized the attention mechanism by ditching recurrence and convolutions and instead relying solely on a self-attention mechanism.

Self-attention, sometimes called intra-attention, computes a representation of a sequence by relating different words in the same sequence.

The main components used by the Transformer attention are quite similar to the ones we already encountered in the earlier sections.

Single-head attention [Image by Author]

We also make use of query, key, and value vectors. However, this time we pack them together as a set and store them into three matrices Q, K, and V respectively. Next, we have three projection matrices, allowing us to generate different subspace representations of the query, key, and value matrices. The last component is yet another projection matrix for the multi-head output.

The computational steps needed follow the general attention mechanism by implementing scaled dot-product attention.

Since we store the queries, keys, and values in three matrices, we can apply the scaled dot-product attention to the entire set of queries simultaneously.

The scaling factor is used to deal with the problem of vanishing gradients that can arise due to the computation of large dot-products in combination with the softmax function.

Multi-head attention [Image by Author]

Building upon the single-head attention, the multi-head attention mechanism linearly projects the queries, keys, and values multiple times. Each time using a different learned projection.

This allows us to apply the single attention mechanism to each projection in parallel and produce multiple outputs which are then concatenated and projected one last time in order to produce the final output.

The underlying idea here is to be able to extract information from different learned representation subspaces which would be impossible by simply relying on a single attention head.

Conclusion

In this article, we covered the attention mechanism in general and in the context of a transformer-based application. We introduced the main components as well as the computational steps needed to compute the attention.

By encountering the concept of self-attention, we started to build the foundation needed to understand, apply, and implement a transformer-based neural network.

In the next episode, we continue building upon that foundation by dissecting the transformer model.

So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

References / Further Material:

Naturallanguageprocessing
NLP
Transformers
Deep Learning
Ml So Good
Recommended from ReadMedium