The curious case of the vanishing & exploding gradient

Summary

The article discusses the challenges of vanishing and exploding gradients in RNN training and presents methods to mitigate these issues, including gradient clipping, identity RNN architecture, and Long Short-Term Memory (LSTM) networks.

Abstract

The article delves into the training challenges of Recurrent Neural Networks (RNNs), specifically focusing on the vanishing and exploding gradient problems. These issues arise during the backpropagation of errors through time in deep networks, affecting the learning process of long-term dependencies in sequence data. Vanishing gradients occur when gradients diminish exponentially as they are propagated back, hindering the model's ability to learn. Conversely, exploding gradients grow exponentially, leading to numerical instability and potential model failure. To address exploding gradients, the article suggests gradient clipping, a technique that constrains gradients within a threshold to maintain numerical stability without altering the gradient direction. For vanishing gradients, the article proposes the identity RNN architecture, which uses identity matrix initializations and ReLU activation functions to maintain constant error derivatives during backpropagation. Additionally, the article highlights the Long Short-Term Memory (LSTM) architecture as a popular and effective solution for capturing long-term dependencies, thanks to its design that allows for both short-term and long-term memory within the network.

Opinions

The author believes that vanishing and exploding gradients are significant obstacles in training RNNs effectively.
Gradient clipping is presented as a straightforward and effective solution for the exploding gradient problem.
The identity RNN architecture is considered a simple solution to mitigate vanishing gradients.
Long Short-Term Memory (LSTM) networks are highly regarded for their ability to handle long-term dependencies, which are crucial for tasks like text generation and language modeling.
The article implies that addressing the gradient issues is essential for the successful training of RNNs and improving their performance on sequence-based tasks.

Understanding why gradients explode or vanish and methods for dealing with the problem.

In the last post, we introduced a step by step walkthrough of RNN training and how to derive the gradients of the network weights using back propagation and the chain rule. But it turns out that during this training the RNN can suffer greatly from two problems: 1. Vanishing gradients or 2. Exploding gradients.

Why Gradients Explode or Vanish

Recall the many-to-many architecture for text generation shown below and in the introduction to RNN post, lets assume the input sequence to the network is a 20 word sentence: “I grew up in France,…….. I speak French fluently.

We can see from the example above that for the RNN to predict the word “French” which comes at the end of the sequence, it would need information from the word “France”, which occurs further back at the beginning of the sentence. This kind of dependence between sequence data is called long-term dependencies because the distance between the relevant information “France” and the point where it is needed to make a prediction “French” is very wide. Unfortunately, in practice as this distance becomes wider, RNNs have a hard time learning these dependencies because it encounters either a vanishing or exploding gradient problem.

These problems arise during training of a deep network when the gradients are being propagated back in time all the way to the initial layer. The gradients coming from the deeper layers have to go through continuous matrix multiplications because of the the chain rule, and as they approach the earlier layers, if they have small values (<1), they shrink exponentially until they vanish and make it impossible for the model to learn , this is the vanishing gradient problem. While on the other hand if they have large values (>1) they get larger and eventually blow up and crash the model, this is the exploding gradient problem

Dealing with Exploding Gradients

When gradients explode, the gradients could become NaN because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply gradient clipping; which places a predefined threshold on the gradients to prevent it from getting too large, and by doing this it doesn’t change the direction of the gradients it only change its length.

Reference: Ian Goodfellow et. al, “Deep Learning”, MIT press, 2016

Dealing with Vanishing Gradients

One simple solution for dealing with vanishing gradient is the identity RNN architecture; where the network weights are initialized to the identity matrix and the activation functions are all set to ReLU and this ends up encouraging the network computations to stay close to the identity function. This works well because when the error derivatives are being propagated backwards through time, they remain constants of either 0 or 1, hence aren’t likely to suffer from vanishing gradients.

An even more popular and widely used solution is the Long Short-Term Memory architecture (LSTM); a variant of the regular recurrent network which was designed to make it easy to capture long-term dependencies in sequence data. The standard RNN operates in such a way that the hidden state activation are influenced by the other local activations closest to them, which corresponds to a “short-term memory”, while the network weights are influenced by the computations that take place over entire long sequences, which corresponds to a “long-term memory”. Hence the RNN was redesigned so that it has an activation state that can also act like weights and preserve information over long distances, hence the name “Long Short-Term Memory”.

LSTM Unit

The curious case of the vanishing & exploding gradient

Understanding why gradients explode or vanish and methods for dealing with the problem.

Why Gradients Explode or Vanish

Dealing with Exploding Gradients

Dealing with Vanishing Gradients

Conclusion