Summary

The provided web content discusses the architecture and functionality of Long Short Term Memory (LSTM) networks, detailing their structure, the purpose of gates, and the equations governing their operation, emphasizing their ability to handle long-term dependencies and mitigate issues like vanishing and exploding gradients present in traditional RNNs.

Abstract

LSTM networks are a specialized type of recurrent neural network (RNN) designed to overcome the limitations of standard RNNs, particularly in handling long-term dependencies and avoiding the vanishing and exploding gradient problems. The article explains that LSTMs achieve this through a sophisticated architecture involving a cell state that acts as a memory component, allowing the network to selectively remember information over long sequences. This cell state is regulated by three types of gates: the input gate, forget gate, and output gate. Each gate employs a sigmoid activation function to determine the flow of information, with outputs close to 0 or 1, effectively blocking or allowing the passage of data. The equations governing these gates and the cell state are provided to elucidate the LSTM's internal mechanisms. The article also illustrates how LSTMs can maintain context within sequences, such as subject-verb agreement over long sentences, and concludes with a visual representation of an LSTM block at a given time step, summarizing the flow of data and transformations occurring within the network.

Opinions

The author suggests that understanding LSTM requires familiarity with the concept of gates and cell state, indicating a complexity that may not be immediately accessible to those without prior knowledge.
The article implies that LSTMs are superior to traditional RNNs for tasks involving long-term dependencies due to their ability to remember and utilize past information effectively.
The use of sigmoid functions in gates is justified by the need for clear decisions on whether to keep or discard certain features, reflecting a design choice tailored to the network's memory functionality.
The author encourages reader engagement by inviting suggestions and acknowledging the potential helpfulness of the post, as indicated by the request for feedback and claps.
A recommendation is made for an AI service, ZAI.chat, which is presented as a cost-effective alternative to ChatGPT Plus (GPT-4), suggesting the author's endorsement of this service based on performance and value.

LSTM and its equations

LSTM stands for Long Short Term Memory, I myself found it difficult to directly understand LSTM without any prior knowledge of the Gates and cell state used in Long Short Term Memory neural networks so, this post is an attempt to get familier with a LSTM model which uses gates and cell state.

Why do we need LSTM if we have RNN?

LSTM can be used to solve problems faced by the RNN model. So, it can be used to solve:

Long term dependency problem in RNNs.
Vanishing Gradient & Exploding Gradient.

The heart of a LSTM network is it’s cell or say cell state which provides a bit of memory to the LSTM so it can remember the past.

i.e The cell state may remember the gender of the subject in a given input sequence so that the proper pronoun or verb can be used.

Let us consider some examples:

The cat which already ate ………………… was full.
The cats which already ate …………………. were full.

in between the dots represents the presence of a long sentence but the subject has not changed yet.

In the first sentence “The cat” is singular so, the lstm cell must remember that feature to use “was”.

Similarly, in second example “ were” should be used for the subject “The cats”.

LSTM is made up of Gates:

In LSTM we will have 3 gates:

1) Input Gate.

2) Forget Gate.

3) Output Gate.

Gates in LSTM are the sigmoid activation functions i.e they output a value between 0 or 1 and in most of the cases it is either 0 or 1.

we use sigmoid function for gates because, we want a gate to give only positive values and should be able to give us a clear cut answer whether, we need to keep a particular feature or we need to discard that feature.

“0” means the gates are blocking everything.

“1” means gates are allowing everything to pass through it.

The equations for the gates in LSTM are:

First equation is for Input Gate which tells us that what new information we’re going to store in the cell state(that we will see below).

Second is for the forget gate which tells the information to throw away from the cell state.

Third one is for the output gate which is used to provide the activation to the final output of the lstm block at timestamp ‘t’.

The equations for the cell state, candidate cell state and the final output:

To get the memory vector for the current timestamp (c_{t}) the candidate is calculated.

Now, from the above equation we can see that at any timestamp, our cell state knows that what it needs to forget from the previous state(i.e f_{t} * c_{t-1}) and what it needs to consider from the current timestamp (i.e i_{t} * c`_{t}).

note: * represents the element wise multiplication of the vectors.

Lastly, we filter the cell state and then it is passed through the activation function which predicts what portion should appear as the output of current lstm unit at timestamp t.

We can pass this h_{t} the output from current lstm block through the softmax layer to get the predicted output(y_{t}) from the current block.

Let’s look at a block of lstm at any timestamp {t}.

With the help of all the equations mentioned above, we can easily understand the above block or we can ourself draw the block diagram.

If anyone have suggestions please comment it below or if the post helped you, give it a clap!!.