avatarMarvin Lanhenke

Summary

The article introduces the concept of attention in neural networks, particularly in the context of NLP, and discusses its importance for handling longer input sequences and focusing on the most relevant information.

Abstract

In the 19th installment of the #30DaysOfNLP series, the author delves into the attention mechanism within transformer-based architectures for natural language processing (NLP). The article builds upon the previous episode's sequence-to-sequence model for machine translation, emphasizing the limitations of traditional encoder-decoder structures when dealing with longer texts. It explains attention as a method to selectively concentrate on the most important parts of information, drawing parallels with human cognitive processes. The attention mechanism is described as having three main components: the conversion of raw data into vector representations, the creation of a memory list, and the dynamic highlighting of important information for a given task. The author illustrates how attention scores are computed and used to generate a context vector, which informs the final output of the model, such as a translated sentence. The article concludes by setting the stage for a deeper exploration of the transformer attention mechanism in the subsequent episode.

Opinions

  • The author suggests that traditional sequence-to-sequence models struggle with longer input sequences, highlighting the need for more efficient methods like attention mechanisms.
  • Attention is presented as a critical feature for NLP tasks, enabling models to mimic human-like focus and improve performance on complex tasks such as summarizing articles.
  • The article posits that memory and attention are closely linked, both in human cognition and in artificial neural networks, implying that the ability to focus on relevant information is key to effective processing.
  • The author expresses enthusiasm for the transformer attention mechanism, indicating its significance in the evolution of machine learning and NLP.
  • By comparing the attention mechanism to saliency maps in computer vision, the author implies that visual interpretation of attention in NLP models can be as insightful as in vision tasks.
  • The author encourages continued learning and engagement with the #30DaysOfNLP series, suggesting that understanding attention mechanisms is essential for staying at the forefront of NLP advancements.

#30DaysOfNLP

NLP-Day 19: You Better Pay Attention To Transformers (Part 1)

Introducing the concept of attention

Transformer-based architectures #30DaysOfNLP [Image by Author]

In the last episode, we implemented a sequence-to-sequence model for machine translation. We did this by creating a network in an encoder-decoder structure that made use of two LSTMs, mapping one sequence to another.

This approach worked very well for short phrases.

But what if we want to work with longer input sequences? What if we want to summarize a complete online article for example?

As we can imagine, trying to compress a complete document into a single thought vector and determining the most important information can become pretty difficult and computationally expensive. Thus, we need a way to focus.

We need a way to pay attention.

In the following sections, we’re going to learn what attention actually is, why it’s important, and how we can leverage the general concept of attention in the context of machine learning.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: You Better Pay Attention To Transformers (Part 1)

Focalization and concentration

Before we dive straight into the depths, the nitty-gritty of the attention mechanism, we should get an overall feel of the concept.

We should understand what attention actually means.

Attention can be described as an overall level of alertness, the ability to engage with one’s surroundings by selectively concentrating on the most important part of the information, while actively ignoring other parts.

A popular field of study incorporating the concept of attention is computer vision. The use of saliency maps provides a visual way to interpret and see which parts of the image the network pays the most attention to.

An example saliency map [Image by Mariusthart — Own work, CC BY-SA 3.0]

Memory and attention are tightly coupled.

It’s important to realize the dependencies between memory capacity and the ability to focus on the most salient information. The ability to pay attention.

Let’s consider our brain for example.

We as humans don’t have unlimited memory capacity. On the contrary, it is actually quite limited. Thus, it becomes crucial anility to actively select and decide which information is important enough to be stored and remembered.

Paying attention

Now, we have a basic idea of what attention is. But how does all of that fit into the context of machine learning?

In machine learning, attention describes the ability, and the mechanism to dynamically highlight and use the most salient, the most important part of the information at hand. In other terms, we need a dynamic way to decide which information is important and to what degree.

An attention-based system has generally three components.

A rough overview of the 3 components of attention [Image by Author]

At first, a process that reads and converts raw data into a numerical vector representation with one feature vector for each word position. Second, a list of feature vectors created from the first component’s output that represents some kind of memory. And third, a process that “pays attention”, that “exploits” the memory when performing a certain task.

We can think of the encoder-decoder framework as an example.

First of all, we process an input sequence of words and feed that sequence into an encoder. The encoder outputs a numerical representation, a vector for every element.

Next, we create a list based on the encoder’s output and the decoder’s previous hidden states. This list represents some kind of memory that can be used to dynamically highlight which part of the information is most important. We can imagine the output in the form of a learned heat map.

Now, at each time step, we compute score values based on the memory and the decoder’s previous state. The score value tells us how well the input sequence aligns with the current output. After some further processing (normalizing, weighting, etc.) we generate a context vector that basically represents the attention mechanism. The context vector indicates which part of the information or which context is most important to the current output.

A simplified overview of the attention mechanism [Image by Author]

We can also think of the complete process as some form of iterative re-weighting that allows the attention mechanism to flexible and dynamically highlight the most salient parts needed for output generation.

Once the context vector is obtained, we feed it into the decoder to generate the final translated output.

Conclusion

In this article, we gently introduced the concept of attention in general and in the context of machine learning. We not only developed an overall understanding but also got to know the different components of the attention mechanism.

Now it’s time to get to the nuts and bolts of the attention mechanism.

In the next episode, we cover the general and the transformer attention mechanisms, discovering the computational steps needed to create a context vector.

So don’t go anywhere, pay some attention, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

References / Further Material:

Naturallanguageprocessing
NLP
Transformers
Deep Learning
Ml So Good
Recommended from ReadMedium