avatarRiccardo Andreoni

Summary

The web content explains the Beam Search algorithm, its role in sequence models for tasks like language translation and speech recognition, and its various implementations and variants.

Abstract

Beam Search is an essential algorithm for sequence prediction tasks, particularly in natural language processing applications such as language translation, text completion, and chatbots. It efficiently explores multiple potential sequences in parallel, maintaining a fixed number of top candidates at each step, known as the "beam width," to ensure grammatical and contextual relevance. The algorithm operates by narrowing down the most probable words at each step, expanding the search space in a controlled manner. Variants like Length-Normalized, Diverse, Top-k, Alpha-Diverse Beam Search, and Nucleus Sampling adapt the algorithm to different task requirements, balancing between diversity and computational efficiency. Despite its precision and flexibility, Beam Search can incur computational overhead and may prematurely converge to suboptimal solutions.

Opinions

  • The author emphasizes the importance of Beam Search in producing high-quality, contextually appropriate sequences in language models.
  • Beam Search is praised for its ability to systematically explore and evaluate multiple sequence possibilities, leading to precise and relevant outputs.
  • The author suggests that Beam Search's adaptability is enhanced by its various variants, which can be tailored to specific tasks for improved performance.
  • While highlighting the benefits of Beam Search, the author also acknowledges its potential drawbacks, including computational cost and the risk of premature convergence to suboptimal solutions.

Beam Search: the Most Used Algorithm in Sequence Models

Learn the working principles of the most famous algorithm for text translation and speech recognition.

Beam Search allows to consider multiple streams of candidates. Image source: unsplash.com.

Imagine you’re an AI language model, like ChatGPT, completing a sentence. How do you choose the next word to make it not just grammatically correct but also contextually relevant? This is where Beam Search steps in.

By efficiently exploring multiple possibilities in parallel and maintaining top candidates at each step, Beam Search plays a crucial role in the task of predicting subsequent elements. Being an effective and powerful algorithm, it ensures output aligns with grammatical constraints and the context.

To understand Beam Search's impact, think about all the applications requiring precise sequence generation, like language translation, text completion, and chatbots. In all these applications, Beam Search plays a critical role.

In this article, I will introduce the theory and guide you through a practical step-by-step example of the Beam Search algorithm. I will also present several Beam Search variants, and detail all the pros and cons of this fundamental algorithm.

Understanding Beam Search

Imagine you need to translate the following sentence from Spanish to English:

Pablo estará en Nueva York la próxima semana.

We don’t only want to obtain a correct translation, we want to obtain the best one. For a language model, the best output coincides with the most probable one.

To achieve this task, most sequence-to-sequence models employ Beam Search. It serves as an heuristic algorithm, systematically exploring multiple possibilities in parallel. At each step, a defined “beam width” maintains a fixed number of top candidates. This allows the algorithm to explore several candidates.

This approach mimics decision-making processes, with the model evaluating and selecting the most promising options.

Beam Search Step-by-Step

Consider a standard sequence-to-sequence model, represented by the simple network below:

Image by the author.

At each timestep t, the model outputs a single word, used to compose the final sequence. The final sequence will be nothing other than the provided translated sentence.

If you are not familiar with RNNs, I suggest to check the simple introduction I included in this article:

For text translation, we commonly use an encoder, which is a portion of the network dedicated to converting the input sequence into a vectorial form. In this case, the input sequence is the Spanish sentence to be translated. The encoder is followed by a decoder which, at each timestep, returns a piece of the output sequence.

Image by the author.

Let’s check the first step out in detail.

The encoder (green part) receives the Spanish sentence “Pablo estará en Nueva York la próxima semana.” and provides a vector form of it to the decoder (blue part).

Image by the author.

At timestep 1, the decoder will output the first word of the translation. The key question is:

How does the decoder choose the word?

The decoder chooses the most probable word, given the input sequence x. In other words, for each word in the dictionary, the model computes the corresponding probability to be the first word of the output sequence.

The one word that maximizes this probability is chosen.

Here is where Beam Search enters the game: it introduces a parameter B, called beam width, which represents the number of words chosen by the model at each step.

By setting, for instance, B=3, the model will return not only one but three candidate words.

Let’s suppose the most probable first words of the translation are “Pablo”, “Next”, and “During”.

Image by the author.

For each of these 3 candidate words, the model proceeds by guessing the second word in the English translation.

Image by the author.

As always, the model picks the words that maximize a given probability. In this case:

You can clearly see the reason for the name Beam Search now. Starting from a narrow set of 3 candidates, the algorithm will enlarge it at each step, considering the probabilities of all these combinations.

The output of step 2 consists of 9 candidate sequences: “Pablo will”, “Pablo is”, …, “During next”, and “During one”. The model associates each one of them with a probability. By setting the parameter B=3, the model will keep only the three candidate sequences with the higher probability.

Suppose that the candidates with higher probabilities are “Pablo will”, “Pablo is”, and “Next week”. During the 3-rd step, the decoder will guess 3 subsequent word for each candidate.

Image by the author.

The same process is iterated all over again until the model predicts as the most likely word the End of Sentence token. The EOS token states that the sequence reached its end and it is now considered complete.

Beam Search Variants

Beam Search adaptability can be enhanced by several variants of the algorithm. They provide options that may be more or less adequate for diverse sequence generation tasks. The most common variants are:

  • Length-Normalized Beam Search: It adjusts for sentence length disparities by normalizing the probability scores, mitigating biases towards shorter sequences.
  • Diverse Beam Search: It introduces diversity in candidate selection to prevent the algorithm from converging prematurely, promoting exploration of a broader solution space.
  • Top-k Beam Search: It restricts the number of candidates considered at each step to the top-k most probable, offering computational efficiency without compromising performance.
  • Alpha-Diverse Beam Search: It controls the trade-off between diversity and optimization by introducing an alpha parameter, enabling fine-tuning based on specific task requirements.
  • Nucleus Sampling: It defines a probability threshold (nucleus) and samples from the set of most probable candidates, enhancing the algorithm’s focus on high-quality sequences.
Image source: unsplash.com.

Conclusion

In summary, in this post we understood why Beam Search is the most used algorithm in the field of sequence generation, particularly in the context of language translation models, but also for chatbot responses. Its systematic exploration of multiple possibilities in parallel, guided by a defined beam width, enhances the precision and the contextualize of the generated sequences.

I am confident that the step-by-step example I provided will make the abstract concept of Beam Search more concrete.

As I always do in my posts, I will now recap some pros and cons of Beam Search. The main advantages of this algorithm are:

  • Precision: Beam Search excels in generating precise and contextually relevant sequences.
  • Flexibility: Variants like Diverse Beam Search and Top-k Beam Search offer adaptability to diverse task requirements.
  • Efficiency: It efficiently explores a solution space, balancing computational complexity with performance.

However, in Machine Learning there are no free meals. It is then important to clarify the drawbacks of each algorithm. In the case of Beam Search, I identified these possible issues:

  • Computational Cost: The exhaustive exploration of possibilities can incur computational overhead.
  • Potential Convergence: In standard configurations, Beam Search might converge prematurely to suboptimal solutions.
Image by the author.

Finally, I invite you to explore the provided resources to get a deeper understanding of this algorithm.

If you liked this story, consider following me to be notified of my upcoming projects and articles!

References

Artificial Intelligence
Machine Learning
Programming
Algorithms
Sequence Model
Recommended from ReadMedium