Beam Search: the Most Used Algorithm in Sequence Models
Learn the working principles of the most famous algorithm for text translation and speech recognition.

Imagine you’re an AI language model, like ChatGPT, completing a sentence. How do you choose the next word to make it not just grammatically correct but also contextually relevant? This is where Beam Search steps in.
By efficiently exploring multiple possibilities in parallel and maintaining top candidates at each step, Beam Search plays a crucial role in the task of predicting subsequent elements. Being an effective and powerful algorithm, it ensures output aligns with grammatical constraints and the context.
To understand Beam Search's impact, think about all the applications requiring precise sequence generation, like language translation, text completion, and chatbots. In all these applications, Beam Search plays a critical role.
In this article, I will introduce the theory and guide you through a practical step-by-step example of the Beam Search algorithm. I will also present several Beam Search variants, and detail all the pros and cons of this fundamental algorithm.
Understanding Beam Search
Imagine you need to translate the following sentence from Spanish to English:
Pablo estará en Nueva York la próxima semana.
We don’t only want to obtain a correct translation, we want to obtain the best one. For a language model, the best output coincides with the most probable one.
To achieve this task, most sequence-to-sequence models employ Beam Search. It serves as an heuristic algorithm, systematically exploring multiple possibilities in parallel. At each step, a defined “beam width” maintains a fixed number of top candidates. This allows the algorithm to explore several candidates.
This approach mimics decision-making processes, with the model evaluating and selecting the most promising options.
Beam Search Step-by-Step
Consider a standard sequence-to-sequence model, represented by the simple network below:

At each timestep t, the model outputs a single word, used to compose the final sequence. The final sequence will be nothing other than the provided translated sentence.
If you are not familiar with RNNs, I suggest to check the simple introduction I included in this article:
For text translation, we commonly use an encoder, which is a portion of the network dedicated to converting the input sequence into a vectorial form. In this case, the input sequence is the Spanish sentence to be translated. The encoder is followed by a decoder which, at each timestep, returns a piece of the output sequence.

Let’s check the first step out in detail.
The encoder (green part) receives the Spanish sentence “Pablo estará en Nueva York la próxima semana.” and provides a vector form of it to the decoder (blue part).

At timestep 1, the decoder will output the first word of the translation. The key question is:
How does the decoder choose the word?
The decoder chooses the most probable word, given the input sequence x. In other words, for each word in the dictionary, the model computes the corresponding probability to be the first word of the output sequence.

The one word that maximizes this probability is chosen.
Here is where Beam Search enters the game: it introduces a parameter B, called beam width, which represents the number of words chosen by the model at each step.
By setting, for instance, B=3, the model will return not only one but three candidate words.
Let’s suppose the most probable first words of the translation are “Pablo”, “Next”, and “During”.

For each of these 3 candidate words, the model proceeds by guessing the second word y² in the English translation.

As always, the model picks the words that maximize a given probability. In this case:

You can clearly see the reason for the name Beam Search now. Starting from a narrow set of 3 candidates, the algorithm will enlarge it at each step, considering the probabilities of all these combinations.
The output of step 2 consists of 9 candidate sequences: “Pablo will”, “Pablo is”, …, “During next”, and “During one”. The model associates each one of them with a probability. By setting the parameter B=3, the model will keep only the three candidate sequences with the higher probability.
Suppose that the candidates with higher probabilities are “Pablo will”, “Pablo is”, and “Next week”. During the 3-rd step, the decoder will guess 3 subsequent word for each candidate.

The same process is iterated all over again until the model predicts as the most likely word the End of Sentence token. The EOS token states that the sequence reached its end and it is now considered complete.
Beam Search Variants
Beam Search adaptability can be enhanced by several variants of the algorithm. They provide options that may be more or less adequate for diverse sequence generation tasks. The most common variants are:
- Length-Normalized Beam Search: It adjusts for sentence length disparities by normalizing the probability scores, mitigating biases towards shorter sequences.
- Diverse Beam Search: It introduces diversity in candidate selection to prevent the algorithm from converging prematurely, promoting exploration of a broader solution space.
- Top-k Beam Search: It restricts the number of candidates considered at each step to the top-k most probable, offering computational efficiency without compromising performance.
- Alpha-Diverse Beam Search: It controls the trade-off between diversity and optimization by introducing an alpha parameter, enabling fine-tuning based on specific task requirements.
- Nucleus Sampling: It defines a probability threshold (nucleus) and samples from the set of most probable candidates, enhancing the algorithm’s focus on high-quality sequences.

Conclusion
In summary, in this post we understood why Beam Search is the most used algorithm in the field of sequence generation, particularly in the context of language translation models, but also for chatbot responses. Its systematic exploration of multiple possibilities in parallel, guided by a defined beam width, enhances the precision and the contextualize of the generated sequences.
I am confident that the step-by-step example I provided will make the abstract concept of Beam Search more concrete.
As I always do in my posts, I will now recap some pros and cons of Beam Search. The main advantages of this algorithm are:
- Precision: Beam Search excels in generating precise and contextually relevant sequences.
- Flexibility: Variants like Diverse Beam Search and Top-k Beam Search offer adaptability to diverse task requirements.
- Efficiency: It efficiently explores a solution space, balancing computational complexity with performance.
However, in Machine Learning there are no free meals. It is then important to clarify the drawbacks of each algorithm. In the case of Beam Search, I identified these possible issues:
- Computational Cost: The exhaustive exploration of possibilities can incur computational overhead.
- Potential Convergence: In standard configurations, Beam Search might converge prematurely to suboptimal solutions.

Finally, I invite you to explore the provided resources to get a deeper understanding of this algorithm.
If you liked this story, consider following me to be notified of my upcoming projects and articles!
References
- Deep Learning for Natural Language Processing, Brownlee, J. (2019)
- Long short-term memory, Hochreiter, S., & Schmidhuber, J. (1997)
- Neural Turing Machines, Graves, A., et al. (2014)
- A Diversity-Promoting Objective Function for Neural Conversation Models, Li, J., et al. (2016)
- Neural Machine Translation by Jointly Learning to Align and Translate, Bahdanau, D., et al. (2015)