# Multivariate Time Series using Gated Recurrent Unit -GRU

*In this post, we will understand a variation of RNN called GRU- Gated Recurrent Unit. Why we need GRU, how does it work, differences between LSTM and GRU and finally wrap up with an example that will use LSTM as well as GRU*

*Prerequisites*

*Optional read*

Multivariate-time-series-using-RNN-with-keras

*What is Gated Recurrent Unit- GRU?*

- GRU is an improvised version of Recurrent Neural Network(RNN)
- Addresses the vanishing gradient problem of RNN.
- GRU is capable of learning long term dependencies

RNN are neural networks with loop to help persist information. RNN suffer from either exploding gradient or vanishing gradient issue.

*What is Exploding and Vanishing gradients?*

Gradients of a neural network is calculated during backpropagation.

With deeper neural layers in RNN and sharing weights across different RNN cell, we sum up the gradients at each time step. As gradients go through **continuous matrix multiplication due to the chain rule**, they either **shrink exponentially and have small values called Vanishing gradient** or they **blow up to a very large value and this is referred to as Exploding gradients**

*How can we resolve the problem of Vanishing or Exploding gradients?*

Exploding Gradient can be resolved using ** gradient clipping**. In gradient clipping, we set a pre-determined gradient threshold and when the gradients exceed this threshold we scale the gradient to the threshold.

Vanishing Gradients is addressed either Long Short Term Memory(LSTM) or Gated Recurrent Unit(GRU).

we will discuss GRU here

*How does GRU address the Vanishing Gradient problem?*

For this, we need to first understand how GRU works.

** GRU like LSTM is capable of learning long term dependencies**.

** GRU and LSTM both have a gating mechanism to regulate the flow of information** like remembering the context over multiple time steps. They keep track of what information from past can be kept what can be forgotten. To achieve this

*GRU uses Update gate and Reset gate**What is the functionality of Update and Reset gate in remembering long term dependencies?*

**Update Gate**

- How much of previous memory to keep around. Decide what to keep and what to throw away.
- How much of cell state you will update
- Will have a value between 0 and 1
- If the value of the update unit is close to 0 then we remember the previous state
- If the value of the update unit is 1 or close to 1 then we forgot the previous value and store the new value
- Update gate acts similar to the input and forgets gate of LSTM

**Reset Gate also known as Relevance Gate**

- Reset gate decides how much information to forget
- Allows the model to drop information that is irrelevant for the future
- Determines how much of previous memory to keep around

Let’s go step by step and understand how GRU works

*Step 1:***Drop irrelevant information for future**

**Reset Gate takes the input Xt and the previously hidden state ht-1 and applies a sigmoid activation function**.

Reset gate determines if the current state will have the new information or will still have the previous information

If the Reset gate has a value close to 0 then ignore the previous hidden state. This means that the previous information is irrelevant and we want to drop that and store the new information.

*Step 2:***How much of previous memory to be stored**

**Update Gate takes the input Xt and the previously hidden state ht-1 and applies a sigmoid activation function.**

Update gate determines how much of previous memory to keep around, it decides what to keep and what to throw away.

If the value of the update unit is close to 0 then we remember the previous state.

** Step 3**:

**Final memory to be stored**

when the Reset gate **rt** is close to 0, the previous hidden state is ignored and reset with the current input **xt **only.

Hidden state will drop any information that is found to be irrelevant for the future. This a compact representation.

Update gate controls how much information from the previous hidden state will carry over to the current hidden state.

If the value of the update unit is close to 0 then we remember the previous hidden state. If value of the update unit is 1 or close to 1 then we forgot the previous hidden state and store the new value

GRU has separate reset and update gates, each unit will learn to capture dependencies over different time scales. Units that learn to capture **short-term dependencies will tend to have reset gates that are frequently active**. Units that capture **longer-term dependencies will have update gates that are most active**

*Now that we have understood the working of GRU we revisit our question of how GRU solves vanishing gradient issue.*

In Vanishing gradient, the gradients become small or zero and easily vanishes.

Gating mechanism in GRU and LSTM helps resolve the vanishing gradient issue. Shutting the update gate essentially skip layers when calculating the gradient. Gates hold information in memory as long as required and update it with new information only when necessary.

Using a combination of gates either allows to pass or block, so no matter how deep our neural network or input sequence is, the network can remember the gradients.

Intuitively the error is additive instead of multiplicative and hence it is easier to keep in a reasonable range. Forget gate in LSTM or Update gate in GRU help with long term dependencies

*Let’s understand the commonality and difference between LSTM and GRU?*

## The commonality between LSTM and GRU

- LSTM and GRU both have update units with additive component from t to t + 1, which is lacking in the traditional RNN.
- LSTM unit and GRU both
**keep the existing content and add the new content on top of it** - Update gate in GRU and Forget gate in LSTM takes the
**linear sum between the existing state and the newly computed state** - LSTM and GRU
**addresses vanishing and exploding gradient issue**present in RNN

## Differences between LSTM and GRU

**GRU has two gates, reset and update gates. LSTM has three gates, input, forget and output**. GRU does not have an output gate like LSTM. Update gate in GRU does the work of input and forget gate of LSTM**GRU have fewer parameters**so they are**computationally more efficient**and need less data to generalize than LSTM**LSTM maintains an internal memory state c**, while GRU does not have a separate memory cell**GRU does not have any mechanism to control the degree to which its state or memory content is exposed**, but exposes the whole state or memory content each time. LSTM can control how much memory content it wants to expose.

*Finally, we wrap up with an example that will use LSTM as well as GRU*

Here I have used the Electric power consumption data set .

Importing required libraries

```
import pandas as pd
import numpy as np
from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, GRU
import tensorflow as tf
from datetime import datetime
```

Reading the data set, parsing the dates and inferring the date format to date time. We also fill the Nan’s with 0.

```
dataset = read_csv("c:\data\power_consumption.csv",
parse_dates={'dt' : ['Date', 'Time']},
infer_datetime_format=True,
index_col= 0,
na_values=['nan','?'])
dataset.fillna(0, inplace=True)
values = dataset.values
```

```
# ensure all data is float
values = values.astype('float32')
```

Looking at the sample data from dataset

`dataset.head(4)`

As the input features are on different scale, we need to normalize the features. We are using using Min Max scalar

```
# normalizing input features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
```

`scaled =pd.DataFrame(scaled)`

Looking at the data after it is normalized

`scaled.head(4)`

We define a function to create the time series data set. We can specify the look back interval and the predicted column

```
def create_ts_data(dataset, lookback=1, predicted_col=1):
temp=dataset.copy()
temp["id"]= range(1, len(temp)+1)
temp = temp.iloc[:-lookback, :]
temp.set_index('id', inplace =True)
predicted_value=dataset.copy()
predicted_value = predicted_value.iloc[lookback:,predicted_col]
predicted_value.columns=["Predcited"]
predicted_value= pd.DataFrame(predicted_value)
predicted_value["id"]= range(1, len(predicted_value)+1)
predicted_value.set_index('id', inplace =True)
final_df= pd.concat([temp, predicted_value], axis=1)
#final_df.columns = ['var1(t-1)', 'var2(t-1)', 'var3(t-1)', 'var4(t-1)', 'var5(t-1)', 'var6(t-1)', 'var7(t-1)', 'var8(t-1)','var1(t)']
#final_df.set_index('Date', inplace=True)
return final_df
```

We now create the time series dataset with looking back one time step

```
reframed_df= create_ts_data(scaled, 1,0)
reframed_df.fillna(0, inplace=True)
reframed_df.columns = ['var1(t-1)', 'var2(t-1)', 'var3(t-1)', 'var4(t-1)', 'var5(t-1)', 'var6(t-1)', 'var7(t-1)','var1(t)']
```

`print(reframed_df.head(4))`

Splitting data set into test and train data set

```
# split into train and test sets
values = reframed_df.values
training_sample =int( len(dataset) *0.7)
```

```
train = values[:training_sample, :]
test = values[training_sample:, :]
```

```
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
```

Reshaping the data set to 3D with sample size, lookback time steps, and the input features.

```
# reshape input to be 3D [samples, time steps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
```

`print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)`

We now create the LSTM model with 3 LSTM layers and one Dense layer. We compile the model using Adam optimizer . Loss is calculated using mean absolute error(MAE)

```
model_lstm = Sequential()
model_lstm.add(LSTM(75, return_sequences=True,input_shape=(train_X.shape[1], train_X.shape[2])))
model_lstm.add(LSTM(units=30, return_sequences=True))
model_lstm.add(LSTM(units=30))
model_lstm.add(Dense(units=1))
model_lstm.compile(loss='mae', optimizer='adam')
```

Let’s look at the LSTM model summary

`model_lstm.summary()`

Fitting the LSTM model

```
# fit network
history_lstm = model_lstm.fit(train_X, train_y, epochs=10, batch_size=64, validation_data=(test_X, test_y), shuffle=False)
```

We now create GRU with similar layers like LSTM

```
model_gru = Sequential()
model_gru.add(GRU(75, return_sequences=True,input_shape=(train_X.shape[1], train_X.shape[2])))
model_gru.add(GRU(units=30, return_sequences=True))
model_gru.add(GRU(units=30))
model_gru.add(Dense(units=1))
```

`model_gru.compile(loss='mae', optimizer='adam')`

Let’s look at the GRU model summary

`model_gru.summary()`

We can see that LSTM and GRU had the same architecture but the number of parameters in LSTM is 44,971 whereas GRU in GRU is 33,736. GRU is a simpler model with two gates compared to LSTM that has three gates. As GRU has fewer parameters it is computationally more efficient than LSTM.

Fitting the GRU model

```
# fit network
gru_history = model.fit(train_X, train_y, epochs=10, batch_size=64, validation_data=(test_X, test_y), shuffle=False)
```

For understanding how the loss varied across LSTM and GRU we plot the loss

```
pyplot.plot(history_lstm.history['loss'], label='LSTM train', color='red')
pyplot.plot(history_lstm.history['val_loss'], label='LSTM test', color= 'green')
```

```
pyplot.plot(gru_history.history['loss'], label='GRU train', color='brown')
pyplot.plot(gru_history.history['val_loss'], label='GRU test', color='blue')
```

```
pyplot.legend()
pyplot.show()
```

**What did I learn while creating the model**

Bad data with null values caused accuracy and loss to be NAN. To resolve that ensure you do not have any nulls in the data

## References:

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation