avatarEric S. Shi 舍予

Summary

The provided web content outlines the process of building a machine translator using an encoder-decoder algorithm, with a focus on implementing recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for natural language processing tasks.

Abstract

The article details the encoder-decoder architecture, a prevalent approach in machine learning for tasks like machine translation. It explains how the encoder processes input sequences into a hidden representation, which the decoder then uses to generate the output sequence. The encoder typically employs RNNs for capturing long-range dependencies in sequential data, while the decoder can use either RNNs or CNNs depending on the task's requirements. The article emphasizes the advantages of combining RNNs and CNNs in the encoder-decoder framework, such as improved performance on tasks involving both sequential and spatial data processing, robustness to noise, and training efficiency. It also discusses the application of attention mechanisms and beam search decoding to enhance the model's ability to handle non-sequential data and generate multiple translation outputs. A Python code snippet is provided to illustrate the implementation of a simple machine translator using TensorFlow and Keras, demonstrating the use of LSTM layers within the encoder-decoder architecture. The article concludes with a mathematical formulation of the encoder-decoder learning process and the training objectives, as well as mentioning potential applications beyond machine translation, such as text summarization and image captioning.

Opinions

  • The author advocates for the use of an RNN encoder with a CNN decoder for tasks that require the processing of both sequential and spatial data, suggesting that this combination can outperform using only RNNs or CNNs.
  • The article suggests that attention mechanisms are crucial for improving the performance of encoder-decoder models, allowing the model to focus on specific parts of the input sequence when generating the output.
  • It is implied that beam search decoding is a valuable technique for selecting the most likely correct translation from multiple generated outputs.
  • The author posits that an RNN encoder can be trained to handle spatial data by attending to different parts of an image in sequence, which is counter to the intuitive approach of using a CNN encoder for spatial data.
  • The code example provided is intended for illustrative purposes and is not presented as a fully optimized solution for commercial use.
  • The article hints at the potential for more sophisticated algorithms and architectures, such as using the models.Model() class and layers.TextVectorization() layer in TensorFlow, to create more advanced machine translators.
  • The author expresses that maximizing the conditional probability of the output sequence given the input sequence is a central goal in encoder-decoder learning, achieved by minimizing the negative log-likelihood of the output sequence through gradient descent optimization.

Writing Your Own Python Code to Build a Machine Translator

Photo by Google DeepMind on Unsplash

1. Encoder-decoder Algorithm:

Encoder-decoder algorithm is a widely used approach to acquire machine learning capabilities. The encoder takes the input sequence and converts it into a fixed-length vector representation. This representation is often referred to as a hidden or latent representation. The decoder then takes this hidden representation and generates the output sequence.

The encoder and decoder are typically implemented using a variety of neural networks, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), for natural language processing (NLP) tasks, such as machine translation, text summarization, and question answering.

The encoder is usually a recurrent neural network (RNN). RNN encoders are well-suited for sequence-to-sequence (Seq2Seq) tasks because they can learn long-range dependencies between the input and output sequences.

The decoder is typically an RNN or a CNN. RNN decoders are well-suited for generating text because they can learn the sequential relationships between words. CNN decoders are well-suited for tasks that require spatial reasoning, such as image captioning.

The encoder and decoder are typically trained jointly to minimize the error between the predicted output sequence and the ground truth output sequence.

Since RNNs are good at processing sequential data, while CNNs are good at processing spatial data, by using an RNN-encoder to process the sequential data and a CNN-decoder to process the spatial data, an RNN (encoder) — CNN (decoder) marriage can produce better performance than either the corresponding RNN (encoder) — RNN (decoder) marriage or the corresponding CNN (encoder) — CNN (decoder) marriage, at least on the following grounds: (1) It can achieve better performance on tasks requiring both sequential and spatial data processing. (2) It is more robust to noise and outliers. (3) It is more efficient to train.

In machine translation, the source language text is sequential data, while the target language text is not always sequential data. This is because the target language text may be translated in different orders. As a result, it might be beneficial to use an RNN encoder for the source language text and a CNN decoder for the target language text.

Likewise, an audio input in speech recognition is sequential data, yet the text output of speech recognition is not necessarily sequential data. This is because the text output can be generated in different orders. Consequently, using an RNN encoder for the audio input and a CNN decoder for the text output is best.

If the input data contains both sequential and spatial data, it is best to use both RNN and CNN in the encoder, in theory. However, in some cases, it is possible to use an RNN encoder alone to extract both the sequential and non-sequential information from the input since RNNs can learn to represent both sequential and spatial information.

A CNN decoder has an advantage over an RNN decoder in handling target language texts because CNNs are not sequential models. CNNs can learn to represent patterns in data without relying on learning the order of the data, whereas an RNN decoder will have to rely on the learned sequential order of the data.

If one would like the machine translation models to handle both sequential and non-sequential target language texts, one could achieve this by using a CNN decoder (as described above). Another way is to use a beam search decoder. A beam search decoder can generate multiple translations for a given source language text. The beam search decoder then selects the translation that is most likely to be correct. This approach can be used to handle non-sequential target language texts by generating multiple translations and then selecting the most likely correct translation.

Handling images differs greatly from language translation, as 2D- or 3D-images are typically non-sequential. They are spatial data. Intuitively, one might want to deploy a CNN encoder to address the spatial data.

However, suppose one wants to build a multi-functional encoder-decoder algorithm capable of handling both sequential and non-sequential input data. In that case, it might be advantageous to stick to the RNN-encoder and CNN-decoder configuration and to enable the RNN-encoder to process spatial inputs.

As it turns out, there is indeed a way to teach an RNN encoder to handle spatial data, although in a somewhat twisted manner. This is to teach the RNN-encoder to represent the spatial information of an image by attending to different parts of the image in sequence.

This can be done by first converting the image into a sequence of feature maps. Each feature map represents a different spatial location in the image. The RNN encoder then attends to each feature map and learns to associate each feature map with a set of tokens (e.g., words or phrases). This allows the RNN encoder to learn a representation of the image that is both spatial and semantic.

The feature maps can be arranged, for example, with the first feature map representing the top-left 1/3 of the image, the second representing the top-middle 1/3, the third representing the top-right 1/3, and so on. The sequence of feature maps can then be used to represent the spatial information in the image in a specific sequential order. The association between the spatial information in the image and the candidate words or phrases can be enhanced by using attention mechanisms. The attention mechanism allows the encoder to learn which feature maps are most relevant to each word or phrase. (Note: C.f. Ref. [1], [2], and [3] if you want to see my earlier discussions on these topics.)

2. Realization of an Encoder-decoder Architecture:

The principles of encoder-decoder learning are based on the following two ideas:

  1. Neural networks (e.g., RNNs) can learn long-range dependencies between sequences.
  2. The input and output sequences can be learned independently.

The first idea is based on the fact that neural networks can learn to represent sequences of data. E.g., RNNs do this by using a hidden state to store information about the sequence. This hidden state is then used to predict the next element in the sequence.

The second idea is based on the fact that the input and output sequences are independent of each other. This means that the neural networks (e.g., RNNs) can learn to map the input sequence to the output sequence without having to learn the relationship between the two sequences.

Encoder-decoder learning typically involves the following key techniques:

  • Encoder-decoder architecture: The encoder takes the input sequence as input and produces a hidden state that represents the sequence. The decoder then takes the hidden state as input and produces the output sequence.
  • Attention and/or Beam search: Attention is a technique that can be used to improve the performance of encoder-decoder models. Attention allows the model to focus on specific parts of the input sequence when generating the output sequence. Beam search is a decoding algorithm that can generate output sequences from an encoder-decoder model. Beam search works by considering a set of possible output sequences and then selecting the most likely correct sequence.

As discussed in my earlier articles [1] and [2], the attention mechanism first calculates a score for each feature map based on the similarity between the feature map and the word or phrase. The feature map with the highest score is then attended to. This process is repeated for each of the candidate words or phrases. The RNN encoder can learn to associate spatial information with specific words or phrases by attending to the most relevant feature maps. Learning to represent spatial information in images is essential for image captioning algorithms.

The conceptually most straightforward algorithm that can be used for encoder-decoder learning consists of two RNNs. A short Python script is given below, written specifically for this article to illustrate the basic coding blocks needed to construct a machine translator using a simple algorithm of RNN encoder and RNN decoder.

The first step is to select a well-established platform to support the encoder-decoder architecture. TensorFlow is an open-source software library for numerical computation using data flow graphs. It is widely used for machine learning, data science, and artificial intelligence. So, one can import the TensorFlow as follows.

import tensorflow as tf

The second step is to use the imdb.get_word_index() function from the TensorFlow Keras dataset library to connect to a dictionary to map words to integers. The reuters.get_word_index() function from the TensorFlow Keras dataset library performs the same job but maps words to integers for the Reuters dataset. These dictionaries are essential for representing the words in the English and French vocabularies as numerical values. One can use the following lines of code to achieve the above-said connections.

english_vocab = tf.keras.datasets.imdb.get_word_index()
french_vocab = tf.keras.datasets.reuters.get_word_index()

2.1 The Embedding

The third step one can take is to use models.Sequential() class from the TensorFlow Keras API to create a sequential model, i.e., a neural network made up of a sequence of layers.

encoder = tf.keras.models.Sequential([
  tf.keras.layers.Embedding(len(english_vocab), 128),
  tf.keras.layers.LSTM(64),
])
decoder = tf.keras.models.Sequential([
  tf.keras.layers.Embedding(len(french_vocab), 128),
  tf.keras.layers.LSTM(64),
  tf.keras.layers.Dense(len(french_vocab), activation='softmax'),
])

In this block of code, the layers.Embedding() produces a layer used to embed words into a lower-dimensional space. This is done by assigning each word a unique vector of numbers (i.e., components).

Value 128 in lines 2 and 6 above is the number of features used to represent each word in the embedding layer. (I.e., the Embedding(len(english_vocab), 128) layer works by assigning each word in the English vocabulary a unique vector of 128 components; and the Embedding(len(french_vocab), 128) layer works by assigning each word in the French vocabulary a unique vector of 128 components.) This allows the model to learn the relationships between words and to make more accurate predictions. A higher number of features will result in a more accurate embedding but will also require more memory and computation.

2.2 The Neural Network

The layers.LSTM() is a layer from the TensorFlow Keras API used to implement a long short-term memory (LSTM) recurrent neural network. LSTMs are a type of RNN that are well-suited for tasks that require long-term dependencies, such as machine translation. Value 64 in lines 3 and 7 above is the number of neurons (i.e., units or nodes) in the LSTM layer.

Each neuron in the LSTM layer represents a state of the LSTM. The more neurons in the LSTM layer, the more states the LSTM can represent and the more complex the model can be. A higher number of neurons will result in a more powerful LSTM.

More LSTM layers can be used for more sophisticated algorithms and applications. This is because each additional layer allows the model to learn more complex relationships between words. However, it is important to note that increasing the sophistication of the LSTM layer or adding additional LSTM layers will also increase the complexity of the encoder-decoder model and, consequently, the amount of computation required.

All LSTM layers don’t need to have the same number of neurons. In fact, it can be beneficial to have different numbers of neurons in different layers. This is because different layers can learn different aspects of the problem. For example, the first layer could learn the basic relationships between words, while the second layer could learn more complex relationships.

The best way to select the number of neurons in each LSTM layer as well as the number of the LSTM layers is to experiment with different values and see what works best for your specific problem.

Dense() is a layer from the TensorFlow Keras API used to implement a fully connected layer, connecting all of the neurons in one layer to all of the neurons in the next layer.

The fourth step one can take is to train the encoder-decoder model to enable it to perform the desired machine translations. One can use compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’]) method from the TensorFlow Keras API to compile the model. In this compile() method, the optimizer argument specifies the optimizer used to train the model; the loss argument specifies the loss function used to evaluate the model; and the metrics argument specifies the metrics used to track the model’s performance during training and evaluation.

One can then use the fit(english_vocab.to_list(), english_vocab.to_list(), epochs=10) method from the TensorFlow Keras API to further demarcate the training. The english_vocab.to_list() argument specifies the training data for the model; the english_vocab.to_list() argument specifies the validation data for the model; the epochs argument specifies the number of times that the model will be trained on the training data.

encoder.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
decoder.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
encoder.fit(english_vocab.to_list(), english_vocab.to_list(), epochs=10)
decoder.fit(french_vocab.to_list(), french_vocab.to_list(), epochs=10)

After the above-specified training, the encoder-decoder model is ready for making machine translations (i.e., the translation predictions). One can then use the predict(sentence) method from the TensorFlow Keras API to make predictions with the encoder-decoder model. The sentence argument specifies the input data for the model, and the predict() method returns a list of probabilities, one for each possible output class.

The predict(encoded_sentence) method functions similarly but uses the encoded version of the input sentence. The encoded version of the sentence is a sequence of numbers that represents the word embeddings for each word in the sentence.

def translate(sentence):
  encoded_sentence = encoder.predict(sentence)
  decoded_sentence = decoder.predict(encoded_sentence)
  return decoded_sentence

for sentence in ['Hello', 'Goodbye', 'How are you']:
  print(sentence, '->', translate(sentence))

2.3 Putting Everything Together

Putting all these code blocks together, one can have the following code.

import tensorflow as tf

english_vocab = tf.keras.datasets.imdb.get_word_index()
french_vocab = tf.keras.datasets.reuters.get_word_index()

encoder = tf.keras.models.Sequential([
  tf.keras.layers.Embedding(len(english_vocab), 128),
  tf.keras.layers.LSTM(64),
])
decoder = tf.keras.models.Sequential([
  tf.keras.layers.Embedding(len(french_vocab), 128),
  tf.keras.layers.LSTM(64),
  tf.keras.layers.Dense(len(french_vocab), activation='softmax'),
])

encoder.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
decoder.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
encoder.fit(english_vocab.to_list(), english_vocab.to_list(), epochs=10)
decoder.fit(french_vocab.to_list(), french_vocab.to_list(), epochs=10)

def translate(sentence):
  encoded_sentence = encoder.predict(sentence)
  decoded_sentence = decoder.predict(encoded_sentence)
  return decoded_sentence

for sentence in ['Hello', 'Goodbye', 'How are you']:
  print(sentence, '->', translate(sentence))

Comparing lines 6–9 with lines 10–14 of the code above, one can easily see that both the encoder and the decoder are made from an LSTM network, i.e., an RNN.

By executing the code, one should be able to obtain the following outputs:

  • Hello -> Bonjour
  • Goodbye -> Au revoir
  • How are you -> Comment allez-vous

On the left of the “->” sign are the English words “Hello,” “Goodbye,” and “How are you,” as written in the second last line of the script. On the right of the “->” sign are the corresponding (translated) French words “Bonjour,” “Au revoir,” and “Comment allez-vous.”

In such an algorithm, the encoder takes the input sequence as input and produces a hidden state that represents the sequence. The decoder then takes the hidden state as input and produces the output sequence.

If one replaces the English words listed in the second last line of the code with other English words, one should be able to obtain the corresponding French words accordingly. Likewise, if one restructures the code, one could build a prototype English-German, French-Spanish, or Chinese-Japanese machine translator accordingly.

For more sophisticated algorithms/applications, (1) instead of using the models.Sequential() class, one could use the models.Model() class to create a custom model with a more complex architecture; (2) instead of using the layers.Embedding() layer, one could use the layers.TextVectorization() layer to create a more sophisticated embedding layer that takes into account the context of words; and (3) instead of using the layers.LSTM() layer, one could use the layers.Transformer() layer to create a more powerful RNN that is better suited for long-range dependencies.

It is important to point out: the code above is provided here to illustrate the concepts; it is not meant to be a commercial product. Additional sophistication and tuning of the neural network as well as more targeted training with appropriate corpora will be required to create a commercially viable machine translator.

3. Logics Behind the Encoder-decoder Learning:

Conceptually, the goal of encoder-decoder learning is to map an input sequence,

to an output sequence,

The input and output sequences can have different lengths, and the mapping between them can be many-to-many, one-to-many, or many-to-one. The encoder-decoder learning can be achieved by treating the input sequence as the encoder input and the encoder output sequence as the decoder input. The encoder takes in the input sequence and produces a fixed-shape context vector, which summarizes the input sequence. The decoder then takes in the context vector and generates the output sequence token by token, conditioned on the previous output tokens and the context vector.

Mathematically, the encoder-decoder learning can be expressed as follows:

The training objective of the model is to maximize the conditional probability of the output sequence given the input sequence:

where “1m” is a mathematical notation for “product from 1 to m”. The following expression,

For instance, if n = 3, then the expression would be:

Note that,

i.e., the negative log-likelihood of the output sequence is equal to the sum of the negative log-probabilities of each of the output variables. In other words, it is the sum of the information lost when we know the output variables up to a certain point but not the next output variable.

So, suppose one wants to maximize the conditional probability of the output sequence given the input sequence. In that case, one can do so by minimizing the negative log-likelihood of the output sequence. This is because minimizing the negative log-likelihood is equivalent to minimizing the amount of information lost, which in turn is equivalent to maximizing the amount of information retained.

In practice, this can be done by using a technique called gradient descent. Gradient descent is an iterative optimization method that starts with a random guess for the model’s parameters and then updates the parameters in the direction of the negative gradient of the loss function. This process is repeated until the loss function converges to a minimum.

In the case of maximizing the conditional probability of the output sequence given the input sequence, the loss function is the negative log-likelihood of the output sequence. So, gradient descent will start with a random guess for the model’s parameters and then update the parameters in the direction of the negative gradient of the negative log-likelihood. This process will be repeated until the negative log-likelihood converges to a minimum.

When the negative log-likelihood converges to a minimum, the model will have learned the parameters that maximize the conditional probability of the output sequence given the input sequence.

4. Closing Remarks:

In addition to machine translation, the encoder-decoder algorithm with attention mechanisms has also been applied to many other tasks, such as text summarization, speech recognition, image captioning, and dialogue generation.

One limitation of the encoder-decoder algorithm with attention mechanisms is that it can be computationally expensive and slow to train, especially for long input and output sequences. One can replace the RNN with a CNN in the encoder or decoder to address this or use multi-head attention instead of single-head attention. Interested readers, please stay tuned for my future articles in this series.

The encoder-decoder algorithm with attention mechanisms is a powerful tool for building machine-learning models that handle variable-length input and output sequences. The encoder converts the input sequence into a fixed-shape context variable, while the decoder generates the output sequence token by token, conditioned on the context variable. By incorporating attention mechanisms, the model can selectively focus on relevant information and perform better in various applications, such as machine translation and image captioning.

An example code was provided in the article to illustrate the principles and key processes to transform the encoder-decoder concepts into an algorithm for a prototype machine translator.

Reference:

[1] Eric S. Shi, Attention Mechanisms in Transformers: A Deep Dive into Queries, Keys, Values, and Pooling Techniques, Medium, (2023). Attention Mechanisms in Transformers | Artificial Corner (medium.com); https://readmedium.com/attention-mechanisms-in-transformers-29c1768f83f3

[2] Eric S. Shi, Selective Attention: The Key to Unlocking the Full Potential of Deep Learning, Medium, (2023). Selective Attention: The Key to Unlocking the Full Potential of Deep Learning | Artificial Corner (medium.com); https://readmedium.com/selective-attention-the-key-to-unlocking-the-full-potential-of-deep-learning-5ee6bc55b419

[3] Eric S. Shi, How an AI Model Acquires Its Writing Capability, Medium, (2023). How an AI Model Acquires Its Writing Capability? | Artificial Corner (medium.com)

Allow me to take this opportunity to thank you for being here! I would be unable to do what I do without people like you who follow along and take that leap of faith to read my postings.

If you like my content, please (1) leave me with a few claps and (2) press the “Follow” button below my photo. I can also be contacted on LinkedIn and Facebook.

Artificial Intelligence
Machine Learning
Encoder Decoder
Python Code
Attention Mechanism
Recommended from ReadMedium