Recurrent Neural Networks

Summary

Recurrent Neural Networks (RNNs) are a class of deep learning algorithms designed to handle sequential data by incorporating the output of previous steps into the current input, making them particularly suited for tasks like time series analysis and natural language processing.

Abstract

Recurrent Neural Networks (RNNs) are an extension of neural networks that are specifically tailored for sequential data, where the output at any given time is influenced by the previous outputs. This characteristic makes them analogous to Autoregressive Moving Average (ARMA) models in time series analysis, which use past values and errors to predict future outcomes. However, unlike ARMA models, RNNs continuously update their parameters through a loop cycle, leveraging an optimization strategy based on Gradient Descent. This self-tuning capability is a significant advantage over traditional statistical models like ARMA, which require human intervention for parameter adjustment after the initial Maximum Likelihood Estimation (MLE) optimization. RNNs are

Neural Networks are the typical algorithm employed in Deep Learning tasks. In my previous post, I’ve been talking about the basic structure of NNs and the elements (parameters, hyperparameters, and strategies) which one should know before building a Deep Learning model.

Here, I’m going to dive deeper into the field of RNN.

The idea behind RNN is that, given a variable and its corresponding target (the value we want to predict), the output of today might affect the output of tomorrow. For those of you who are familiar with time series analysis, it might sound very similar to Autoregressive Moving Average (ARMA) models: indeed, they take into account p past values of the variable (together with q past error terms if the Moving Average component is included) to predict future outcomes of that variable:

Where φ and θ are the parameters of the model.

So, RNN works with the same ground idea. However, as it is a Deep Learning algorithm, it proceeds through a loop cycle, updating and optimizing its parameters on a rolling base.

Let’s consider the following procedure:

So here we have a first input at time t-1, one hidden layer and one output. However, every output at each time depends on the output of the previous period. How? Thanks to the dependency among hidden layers. Indeed, each hidden layer’s state at time t has kept traces of the previous hidden layer’s state. Hence, the hidden layer at time t is calculated as follows:

Where the function f() is one of the typical activation functions (Tanh, ReLU and so forth).

So, if we change the representation of the previous chain, we obtain something like that:

Again, the key difference between RNNs and ARMA models is that the former, like any other ML and DL algorithm, is able to tune its parameters on its own, through an optimization strategy that is based on the Gradient Descent procedure. On the other hand, ARMA models, as they being statistical models, are built through an optimization procedure (the Maximum Likelihood Estimation -MLE), but this optimization occurs once, at the very beginning, then the model can’t update itself: human intervention will be needed.

Anyway, time series analysis is not the only application nor the most popular. Among the many applications, you can think about language modeling and predictions, within the field of Natural Language Processing (NLP). The latter is the field of Artificial Intelligence which implies a deeper interpretation of data, which are provided in the form of human languages.

If you think about how your brain works while reading some pages of a book, you will easily see how the words, sentences, and pages do affect the meaning of the following ones and, since your brain knows that, it will keep them in mind while trying to extrapolate a general meaning.

That’s why RNNs are powerful tools to perform a similar task: while processing words and sentences (once vectorized), they will use the output of one processing as the input of the next processing, since the desired output (let’s say, the comprehension of a whole page so that the model can elaborate an automatic response) cannot be extrapolated only by single, independently-considered words.