Free AI web copilot to create summaries, insights and extended knowledge, download it at here

12263

Abstract

As shown above, there is a tanh function present in the layer. This function is a squashing function. So, what is a squashing function?It is a function which is basically used in the range of -1 to +1 and to manipulate the values based on the inputs.Now, let us consider the structure of an LSTM network:<figure id="2aab"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*lxCd3xU-gLdAb_IVf52m4A.png"><figcaption></figcaption></figure>As denoted from the image, each of the functions in the layers has their own structures when it comes to LSTM networks. The cell state is the horizontal line in the figure and it acts like a conveyer belt carrying certain data linearly across the data channel.Let us consider a step-by-step approach to understand LSTM networks better.<h2 id="b53a">Step 1:</h2>The first step in the LSTM is to identify that information which is not required and will be thrown away from the cell state. This decision is made by a sigmoid layer called as forget gate layer.<figure id="1e9d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*TmepLUuEwKth2ZYmUGK6zg.png"><figcaption></figcaption></figure>The highlighted layer in the above is the sigmoid layer which is previously mentioned.The calculation is done by considering the new input and the previous timestamp which eventually leads to the output of a number between 0 and 1 for each number in that cell state.As typical binary, 1 represents to keep the cell state while 0 represents to trash it.<figure id="7316"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*yfIBK5Xnpx308uo3Msu5GQ.png"><figcaption></figcaption></figure>Consider gender classification, it is really important to consider the latest and correct gender when the network is being used.<h2 id="6bc2">Step 2:</h2>The next step is to decide, what new information we’re going to store in the cell state. This whole process comprises of following steps:<ul><li>A sigmoid layer called the “input gate layer” decides which values will be updated.</li><li>The tanh layer creates a vector of new candidate values, that could be added to the state.</li></ul><figure id="394f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*boIcqQefGpXi7TQT2GFHiw.png"><figcaption></figcaption></figure><figure id="14a5"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*jp8kPBR2Gcoa3B06tAzFHA.png"><figcaption></figcaption></figure>The input from the previous timestamp and the new input are passed through a sigmoid function which gives the value i(t). This value is then multiplied by c(t) and then added to the cell state.In the next step, these two are combined to update the state.<h2 id="2646">Step 3:</h2>Now, we will update the old cell state Ct−1, into the new cell state Ct.First, we multiply the old state (Ct−1) by f(t), forgetting the things we decided to leave behind earlier.<figure id="0c6a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ocx62MLYPieZ18R0HG7yiw.png"><figcaption></figcaption></figure><figure id="e7b6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*JiZfq61LoXNi1PQyb24MHw.png"><figcaption></figcaption></figure>Then, we add i_t* c˜_t. This is the new candidate values, scaled by how much we decided to update each state value.In the second step, we decided to do make use of the data which is only required at that stage.In the third step, we actually implement it.In the language case example which was previously discussed, there is where the old gender would be dropped and the new gender would be considered.<h2 id="2b2a">Step 4:</h2>We will run a sigmoid layer which decides what parts of the cell state we’re going to output.Then, we put the cell state through tanh (push the values to be between −1 and 1)Later, we multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.<figure id="020b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*-TV7vgUVcMlWwiFCux3hCA.png"><figcaption></figcaption></figure><figure id="04f4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*dusm2RXHPqSTxZJb18lWWg.png"><figcaption></figcaption></figure>The calculation in this step is pretty much straightforward which eventually leads to the output.However, the output consists of only the outputs there were decided to be carry forwarded in the previous steps and not all the outputs at once.Summing up all the 4 steps:In the first step, we found out what was needed to be dropped.The second step consisted of what new inputs are added to the network.The third step was to combine the previously obtained inputs to generate the new cell states.Lastly, we arrived at the output as per requirement.Next up, let us consider an interesting use-case.<h1 id="f050">Use Case: Long Short-Term Memory Networks</h1>The use case we will be considering is to predict the next word in a sample short story.We can start by feeding an LSTM Network with correct sequences from the text of 3 symbols as inputs and 1 labeled symbol.Eventually, the neural network will learn to predict the next symbol correctly!<figure id="21c6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*09R6HkY-080l6C-BiBsTSg.png"><figcaption></figcaption></figure><h2 id="ced1">Dataset:</h2>The LSTM is trained using a sample short story which consists of 112 unique symbols. Comma and period are also considered as unique symbols in this case.“long ago, the mice had a general council to consider what measures they could take to outwit their common enemy, the cat . some said this, and some said that but at last a young mouse got up and said he had a proposal to make, which he thought would meet the case . you will all agree , said he , that our chief danger consists in the sly and treacherous manner in which the enemy approaches us . now, if we could receive some signal of her approach, we could easily escape from her . i venture, therefore, to propose that a small bell be procured, and attached by a ribbon round the neck of the cat. by this means we should always know when she was about, and could easily retire while she was in the neighborhood. this proposal met with general applause until an old mouse got up and said that is all very well, but who is to bell the cat? the mice looked at one another and nobody spoke. then the old mouse said it is easy to propose impossible remedies .”<h2 id="e156">Training:</h2>We already know that LSTMs can only understand real numbers. So, the first requirement is to convert the unique symbols into unique integer values based on the frequency of occurrence.Doing this will create a customized dictionary that we can make use of later on to map the values.<figure id="03a4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*39zXPqRF0nGv_xQihhPoAA.png"><figcaption></figcaption></figure>In the above figure, certain symbols are mapped to be integers as shown.The network will create a 112-element vector consisting of the probability of occurrence of each of these unique integer values.Implementation:The code is implemented using Tensorflow as shown below:<div id="e17b"><pre>import numpy as np import tensorflow as tf from tensorflow.contrib import rnn import random import collections import time

start_time = time.time()

def elapsed(sec): if sec<60: return str(sec) + " sec" elif sec<(6060): return str(sec/60) + " min" else: return str(sec/(6060)) + " hr" # Target log path logs_path = '/tmp/tensorflow/rnn_words' writer = tf.summary.FileWriter(logs_path) # Text file containing words for training training_file = 'Story.txt' def read_data(fname): with open(fname) as f: content = f.readlines() content = [x.strip() for x in content] content = [content[i].split() for i in range(len(content))] content = np.array(content) content = np.reshape(content, [-1, ]) return content training_data = read_data(training_file) print("Loaded training data...") def build_dataset(words): count = collections.Counter(words).most_common() dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary) reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return dictionary, reverse_dictionary dictionary, reverse_dictionary = build_dataset(training_data) vocab_size = len(dictionary) # Parameters learning_rate = 0.001 training_iters = 50000 display_step = 1000 n_input = 3 # number of units in RNN cell n_hidden = 512 # tf Graph input x = tf.placeholder("float", [None, n_input, 1]) y = tf.placeholder("float", [None, vocab_size]) # RNN output node weights and biases weights = { 'out': tf.Variable(tf.random_normal([n_hidden, vocab_size])) } biases = { 'out': tf.Variable(tf.random_normal([vocab_size])) } def RNN(x, weights, biases): # reshape to [1, n_input] x = tf.reshape(x, [-1, n_input]) # Generate a n_input-element sequence of inputs # (eg. [had] [a] [general] -> [20] [6] [33]) x = tf.split(x,n_input,1)

<span class="hljs-comment"># 2-layer LSTM, each layer has n_hidden units.</span>
<span class="hljs-comment"># Average Accuracy= 95.20% at 50k iter</span>
rnn_cell = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden),rnn.BasicLSTMCell(n_hidden)])

<span cl

Options

ass="hljs-comment"># 1-layer LSTM with n_hidden units but with lower accuracy. # Average Accuracy= 90.60% 50k iter # Uncomment line below to test but comment out the 2-layer rnn.MultiRNNCell above # rnn_cell = rnn.BasicLSTMCell(n_hidden)

<span class="hljs-comment"># generate prediction</span>
outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)

<span class="hljs-comment"># there are n_input outputs but</span>
<span class="hljs-comment"># we only want the last output</span>
<span class="hljs-keyword">return</span> tf.matmul(outputs[-<span class="hljs-number">1</span>], weights[<span class="hljs-string">'out'</span>]) + biases[<span class="hljs-string">'out'</span>]

pred = RNN(x, weights, biases)

# Loss and optimizer cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y)) optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)

# Model evaluation correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1)) accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Initializing the variables init = tf.global_variables_initializer()

# Launch the graph with tf.Session() as session: session.run(init) step = 0 offset = random.randint(0,n_input+1) end_offset = n_input + 1 acc_total = 0 loss_total = 0

writer.add_graph(session.graph)

<span class="hljs-keyword">while</span> step &lt; training_iters: <span class="hljs-comment"># Generate a minibatch. Add some randomness on selection process. if offset &gt; (len(training_data)-end_offset):</span>
        offset = random.randint(<span class="hljs-number">0</span>, n_input+<span class="hljs-number">1</span>)

    symbols_in_keys = [ [dictionary[ <span class="hljs-built_in">str</span>(training_data[i])]] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(offset, offset+n_input) ]
    symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-<span class="hljs-number">1</span>, n_input, <span class="hljs-number">1</span>])

    symbols_out_onehot = np.zeros([vocab_size], dtype=<span class="hljs-built_in">float</span>)
    symbols_out_onehot[dictionary[<span class="hljs-built_in">str</span>(training_data[offset+n_input])]] = <span class="hljs-number">1.0</span>
    symbols_out_onehot = np.reshape(symbols_out_onehot,[<span class="hljs-number">1</span>,-<span class="hljs-number">1</span>])

    _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                            feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
    loss_total += loss
    acc_total += acc
    <span class="hljs-keyword">if</span> (step+<span class="hljs-number">1</span>) % display_step == <span class="hljs-number">0</span>:
        <span class="hljs-built_in">print</span>(<span class="hljs-string">"Iter= "</span> + <span class="hljs-built_in">str</span>(step+<span class="hljs-number">1</span>) + <span class="hljs-string">", Average Loss= "</span> + \
              <span class="hljs-string">"{:.6f}"</span>.<span class="hljs-built_in">format</span>(loss_total/display_step) + <span class="hljs-string">", Average Accuracy= "</span> + \
              <span class="hljs-string">"{:.2f}%"</span>.<span class="hljs-built_in">format</span>(<span class="hljs-number">100</span>*acc_total/display_step))
        acc_total = <span class="hljs-number">0</span>
        loss_total = <span class="hljs-number">0</span>
        symbols_in = [training_data[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(offset, offset + n_input)]
        symbols_out = training_data[offset + n_input]
        symbols_out_pred = reverse_dictionary[<span class="hljs-built_in">int</span>(tf.argmax(onehot_pred, <span class="hljs-number">1</span>).<span class="hljs-built_in">eval</span>())]
        <span class="hljs-built_in">print</span>(<span class="hljs-string">"%s - [%s] vs [%s]"</span> % (symbols_in,symbols_out,symbols_out_pred))
    step += <span class="hljs-number">1</span>
    offset += (n_input+<span class="hljs-number">1</span>)
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Optimization Finished!"</span>)
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Elapsed time: "</span>, elapsed(time.time() - start_time))
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Run on command line."</span>)
<span class="hljs-built_in">print</span>(<span class="hljs-string">"\ttensorboard --logdir=%s"</span> % (logs_path))
<span class="hljs-built_in">print</span>(<span class="hljs-string">"Point your web browser to: http://localhost:6006/"</span>)
<span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:
    prompt = <span class="hljs-string">"%s words: "</span> % n_input
    sentence = <span class="hljs-built_in">input</span>(prompt)
    sentence = sentence.strip()
    words = sentence.split(<span class="hljs-string">' '</span>)
    <span class="hljs-keyword">if</span> <span class="hljs-built_in">len</span>(words) != n_input:
        <span class="hljs-keyword">continue</span>
    <span class="hljs-keyword">try</span>:
        symbols_in_keys = [dictionary[<span class="hljs-built_in">str</span>(words[i])] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-built_in">len</span>(words))]
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">32</span>):
            keys = np.reshape(np.array(symbols_in_keys), [-<span class="hljs-number">1</span>, n_input, <span class="hljs-number">1</span>])
            onehot_pred = session.run(pred, feed_dict={x: keys})
            onehot_pred_index = <span class="hljs-built_in">int</span>(tf.argmax(onehot_pred, <span class="hljs-number">1</span>).<span class="hljs-built_in">eval</span>())
            sentence = <span class="hljs-string">"%s %s"</span> % (sentence,reverse_dictionary[onehot_pred_index])
            symbols_in_keys = symbols_in_keys[<span class="hljs-number">1</span>:]
            symbols_in_keys.append(onehot_pred_index)
        <span class="hljs-built_in">print</span>(sentence)
    <span class="hljs-keyword">except</span>:
        <span class="hljs-built_in">print</span>(<span class="hljs-string">"Word not in dictionary"</span>)</pre></div><p id="6ff0"><i>This brings us to the end of our article on “Recurrent Neural Networks”. I hope you found this article informative and added value to your knowledge.</i></p><p id="6864">If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to <a href="https://www.edureka.co/blog/?utm_source=medium&amp;utm_medium=content-link&amp;utm_campaign=recurrent-neural-networks">Edureka’s official site.</a></p><p id="27e8">Do look out for other articles in this series which will explain the various other aspects of Deep Learning.</p><blockquote id="1554"><p>1.<a href="https://readmedium.com/tensorflow-tutorial-ba142ae96bca"> TensorFlow Tutorial</a></p></blockquote><blockquote id="ff73"><p>2.<a href="https://readmedium.com/pytorch-tutorial-9971d66f6893"> PyTorch Tutorial</a></p></blockquote><blockquote id="309d"><p>3. <a href="https://readmedium.com/perceptron-learning-algorithm-d30e8b99b156">Perceptron learning Algorithm</a></p></blockquote><blockquote id="1fd3"><p>4. <a href="https://readmedium.com/neural-network-tutorial-2a46b22394c9">Neural Network Tutorial</a></p></blockquote><blockquote id="a136"><p>5.<a href="https://readmedium.com/backpropagation-bd2cf8fdde81"> What is Backpropagation?</a></p></blockquote><blockquote id="7374"><p>6. <a href="https://readmedium.com/convolutional-neural-network-3f2c5b9c4778">Convolutional Neural Networks</a></p></blockquote><blockquote id="fcc8"><p>7.<a href="https://readmedium.com/capsule-networks-d7acd437c9e"> Capsule Neural Networks</a></p></blockquote><blockquote id="7120"><p>8.<a href="https://readmedium.com/recurrent-neural-networks-df945afd7441"> </a><a href="https://readmedium.com/tensorflow-object-detection-tutorial-8d6942e73adc">Object Detection in TensorFlow</a></p></blockquote><blockquote id="6efc"><p>9. <a href="https://readmedium.com/autoencoders-tutorial-cfdcebdefe37">Autoencoders Tutorial</a></p></blockquote><blockquote id="756a"><p>10. <a href="https://readmedium.com/restricted-boltzmann-machine-tutorial-991ae688c154">Restricted Boltzmann Machine Tutorial</a></p></blockquote><blockquote id="5035"><p>11. <a href="https://readmedium.com/pytorch-vs-tensorflow-252fc6675dd7">PyTorch vs TensorFlow</a></p></blockquote><blockquote id="0375"><p>12. <a href="https://readmedium.com/deep-learning-with-python-2adbf6e9437d">Deep Learning With Python</a></p></blockquote><blockquote id="2fb0"><p>13. <a href="https://readmedium.com/artificial-intelligence-tutorial-4257c66f5bb1">Artificial Intelligence Tutorial</a></p></blockquote><blockquote id="e24f"><p>14. <a href="https://readmedium.com/tensorflow-image-classification-19b63b7bfd95">TensorFlow Image Classification</a></p></blockquote><blockquote id="2a2c"><p>15. <a href="https://readmedium.com/artificial-intelligence-applications-7b93b91150e3">Artificial Intelligence Applications</a></p></blockquote><blockquote id="6e6f"><p>16. <a href="https://readmedium.com/become-artificial-intelligence-engineer-5ac2ede99907">How to Become an Artificial Intelligence Engineer?</a></p></blockquote><blockquote id="05c4"><p>17. <a href="https://readmedium.com/q-learning-592524c3ecfc">Q Learning</a></p></blockquote><blockquote id="e21a"><p>18. <a href="https://readmedium.com/apriori-algorithm-d7cc648d4f1e">Apriori Algorithm</a></p></blockquote><blockquote id="5969"><p>19. <a href="https://readmedium.com/introduction-to-markov-chains-c6cb4bcd5723">Markov Chains With Python</a></p></blockquote><blockquote id="bd13"><p>20. <a href="https://readmedium.com/artificial-intelligence-algorithms-fad283a0d8e2">Artificial Intelligence Algorithms</a></p></blockquote><blockquote id="a686"><p>21. <a href="https://readmedium.com/best-laptop-for-machine-learning-a4a5f8ba5b">Best Laptops for Machine Learning</a></p></blockquote><blockquote id="a504"><p>22. <a href="https://readmedium.com/top-artificial-intelligence-tools-36418e47bf2a">Top 12 Artificial Intelligence Tools</a></p></blockquote><blockquote id="f5ca"><p>23. <a href="https://readmedium.com/artificial-intelligence-interview-questions-872d85387b19">Artificial Intelligence (AI) Interview Questions</a></p></blockquote><blockquote id="2e6e"><p>24. <a href="https://readmedium.com/theano-vs-tensorflow-15f30216b3bc">Theano vs TensorFlow</a></p></blockquote><blockquote id="3153"><p>25. <a href="https://readmedium.com/what-is-a-neural-network-56ae7338b92d">What Is A Neural Network?</a></p></blockquote><blockquote id="a537"><p>26. <a href="https://readmedium.com/pattern-recognition-5e2d30ab68b9">Pattern Recognition</a></p></blockquote><blockquote id="d34d"><p>27. <a href="https://readmedium.com/alpha-beta-pruning-in-ai-b47ee5500f9a">Alpha Beta Pruning in Artificial Intelligence</a></p></blockquote><p id="f766"><i>Originally published at <a href="https://www.edureka.co/blog/recurrent-neural-networks/">www.edureka.co</a> on November 28, 2018.</i></p></article></body>

Recurrent Neural Networks (RNN) Tutorial — Analyzing Sequential Data Using TensorFlow In Python

In this article, let us discuss the concepts behind the working of Recurrent Neural Networks. Recurrent Neural Networks have wide applications in image and video recognition, music composition and machine translation.

We will be checking out the following concepts:

Why Not Feed-forward Networks?
What Are Recurrent Neural Networks?
How To Train Recurrent Neural Networks?
Vanishing And Exploding Gradients
Long Short Term Memory (LSTM) Networks
LSTM Use-Case

Why Not Feedforward Networks?

Consider an image classification use-case where you have trained the neural network to classify images of various animals.

So, let’s say you feed in an image of a cat or a dog, the network actually provides an output with a corresponding label to the image of a cat or a dog respectively.

Consider the following diagram:

Here, the first output being an elephant will have no influence of the previous output which was a dog. This means that output at time ‘t’ is independent of output at time ‘t-1’.

Consider this scenario where you will require the use of the previously obtained output:

The concept is similar to reading a book. With every page you move forward into, you need the understanding of the previous pages to make complete sense of the information ahead in most of the cases.

With a feed-forward network the new output at time ‘t+1’ has no relation with outputs at either time t, t-1 or t-2.

So, feed-forward networks cannot be used when predicting a word in a sentence as it will have no absolute relation with the previous set of words.

But, with Recurrent Neural Networks, this challenge can be overcome.

Consider the following diagram:

In the above diagram, we have certain inputs at ‘t-1’ which is fed into the network. These inputs will lead to corresponding outputs at time ‘t-1’ as well.

At the next timestamp, information from the previous input ‘t-1’ is provided along with the input at ‘t’ to eventually provide the output at ‘t’ as well.

This process repeats, to ensure that the latest inputs are aware and can use the information from the previous timestamp is obtained.

Next up in this Recurrent Neural Networks article, we need to check out what Recurrent Neural Networks (RNNs) actually are.

What Are Recurrent Neural Networks?

Recurrent Networks are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets and government agencies.

For better clarity, consider the following analogy:

You go to the gym regularly and the trainer has given you the following schedule for your workout:

Note that all these exercises are repeated in a proper order every week. First, let us use a feed-forward network to try and predict the type of exercise.

The inputs are day, month, and health status. A neural network has to be trained using these inputs to provide us

with the prediction of the exercises.

However, this will not be very accurate considering the input. To fix this, we can make use of the concept of Recurrent Neural Networks as shown below:

In this case, consider the inputs to be the workout done on the previous day.

So if you did a shoulder workout yesterday, you can do a bicep exercise today and this goes on for the rest of the week as well.

However, if you happen to miss a day at the gym, the data from the previously attended timestamp can be considered as shown below:

If a model is trained based on the data it can obtain from the previous exercises, the output from the model will be extremely accurate.

To sum it up, let us convert the data we have into vectors. Well, what are vectors?

Vectors are numbers which are input to the model to denote if you have done the exercise or not.

So, if you have a shoulder exercise, the corresponding node will be ‘1’ and the rest of the exercise nodes will be mapped to ‘0’.

Let us check out the math behind the working of the neural network. Consider the following diagram:

Consider ‘w’ to be the weight matrix and ‘b’ being the bias:

At time t=0, input is ‘x0’ and the task is to figure out what is ‘h0’. Substituting t=0 in the equation and obtaining the function h(t) value. Next, the value of ‘y0’ is found out using the previously calculated values when applied to the new formula.

This process is repeated through all of the timestamps in the model to train a model.

So, how are Recurrent Neural Networks trained?

Training Recurrent Neural Networks

Recurrent Neural Networks use a backpropagation algorithm for training, but it is applied for every timestamp. It is commonly known as Back-propagation Through Time (BTT).

There are some issues with Back-propagation such as:

Vanishing Gradient
Exploding Gradient

Let us consider each of these to understand what is going on

Vanishing Gradient

When making use of back-propagation the goal is to calculate the error which is actually found out by finding out the difference between the actual output and the model output and raising that to a power of 2.

Consider the following diagram:

With the error calculated, the changes in the error with respect to the change in the weight is calculated. But with each learning rate, this has to be multiplied with the same.

So, the product of the learning rate with the change leads to the value which is the actual change in the weight.

This change in weight is added to the old set of weights for every training iteration as shown in the figure. The issue here is when the change in weight is multiplied, the value is very less.

Consider you are predicting a sentence say,“I am going to France” and you want to predict “I am going to France, the language spoken there is _____”

A lot of iterations will cause the new weights to be extremely negligible and this leads to the weights not being updated.

Exploding Gradient

The working of the exploding gradient is similar but the weights here change drastically instead of negligible change. Notice the small change in the diagram below:

We need to overcome both of these and it is a bit of a challenge at first. Consider the following chart:

Continuing this blog on Recurrent Neural Networks, we will be discussing further on LSTM networks.

Long Short-Term Memory Networks

Long Short-Term Memory networks are usually just called “LSTMs”.

They are a special kind of Recurrent Neural Networks that are capable of learning long-term dependencies.

What are long-term dependencies?

Many times only recent data is needed in a model to perform operations. But there might be a requirement from data which was obtained in the past.

Let’s look at the following example:

Consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in the sentence say “The clouds are in the sky”.

The context here was pretty simple and the last word ends up being sky all the time. In such cases, the gap between the past information and the current requirement can be bridged really easily by using Recurrent Neural Networks.

So, problems like Vanishing and Exploding Gradients do not exist and this makes LSTM networks handle long-term dependencies easily.

LSTM has a chain-like neural network layer. In a standard recurrent neural network, the repeating module consists of one single function as shown in the below figure:

As shown above, there is a tanh function present in the layer. This function is a squashing function. So, what is a squashing function?

It is a function which is basically used in the range of -1 to +1 and to manipulate the values based on the inputs.

Now, let us consider the structure of an LSTM network:

As denoted from the image, each of the functions in the layers has their own structures when it comes to LSTM networks. The cell state is the horizontal line in the figure and it acts like a conveyer belt carrying certain data linearly across the data channel.

Let us consider a step-by-step approach to understand LSTM networks better.

Step 1:

The first step in the LSTM is to identify that information which is not required and will be thrown away from the cell state. This decision is made by a sigmoid layer called as forget gate layer.

The highlighted layer in the above is the sigmoid layer which is previously mentioned.

The calculation is done by considering the new input and the previous timestamp which eventually leads to the output of a number between 0 and 1 for each number in that cell state.

As typical binary, 1 represents to keep the cell state while 0 represents to trash it.

Consider gender classification, it is really important to consider the latest and correct gender when the network is being used.

Step 2:

The next step is to decide, what new information we’re going to store in the cell state. This whole process comprises of following steps:

A sigmoid layer called the “input gate layer” decides which values will be updated.
The tanh layer creates a vector of new candidate values, that could be added to the state.

The input from the previous timestamp and the new input are passed through a sigmoid function which gives the value i(t). This value is then multiplied by c(t) and then added to the cell state.

In the next step, these two are combined to update the state.

Step 3:

Now, we will update the old cell state Ct−1, into the new cell state Ct.

First, we multiply the old state (Ct−1) by f(t), forgetting the things we decided to leave behind earlier.

Then, we add i_t* c˜_t. This is the new candidate values, scaled by how much we decided to update each state value.

In the second step, we decided to do make use of the data which is only required at that stage.

In the third step, we actually implement it.

In the language case example which was previously discussed, there is where the old gender would be dropped and the new gender would be considered.

Step 4:

We will run a sigmoid layer which decides what parts of the cell state we’re going to output.

Then, we put the cell state through tanh (push the values to be between −1 and 1)

Later, we multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

The calculation in this step is pretty much straightforward which eventually leads to the output.

However, the output consists of only the outputs there were decided to be carry forwarded in the previous steps and not all the outputs at once.

Summing up all the 4 steps:

In the first step, we found out what was needed to be dropped.

The second step consisted of what new inputs are added to the network.

The third step was to combine the previously obtained inputs to generate the new cell states.

Lastly, we arrived at the output as per requirement.

Next up, let us consider an interesting use-case.

Use Case: Long Short-Term Memory Networks

The use case we will be considering is to predict the next word in a sample short story.

We can start by feeding an LSTM Network with correct sequences from the text of 3 symbols as inputs and 1 labeled symbol.

Eventually, the neural network will learn to predict the next symbol correctly!

Dataset:

The LSTM is trained using a sample short story which consists of 112 unique symbols. Comma and period are also considered as unique symbols in this case.

“long ago, the mice had a general council to consider what measures they could take to outwit their common enemy, the cat . some said this, and some said that but at last a young mouse got up and said he had a proposal to make, which he thought would meet the case . you will all agree , said he , that our chief danger consists in the sly and treacherous manner in which the enemy approaches us . now, if we could receive some signal of her approach, we could easily escape from her . i venture, therefore, to propose that a small bell be procured, and attached by a ribbon round the neck of the cat. by this means we should always know when she was about, and could easily retire while she was in the neighborhood. this proposal met with general applause until an old mouse got up and said that is all very well, but who is to bell the cat? the mice looked at one another and nobody spoke. then the old mouse said it is easy to propose impossible remedies .”

Training:

We already know that LSTMs can only understand real numbers. So, the first requirement is to convert the unique symbols into unique integer values based on the frequency of occurrence.

Doing this will create a customized dictionary that we can make use of later on to map the values.

In the above figure, certain symbols are mapped to be integers as shown.

The network will create a 112-element vector consisting of the probability of occurrence of each of these unique integer values.

Implementation:

The code is implemented using Tensorflow as shown below:

import numpy as np
import tensorflow as tf
from tensorflow.contrib import rnn
import random
import collections
import time
 
start_time = time.time()
 
def elapsed(sec):
    if sec<60:
        return str(sec) + " sec"
    elif sec<(60*60): return str(sec/60) + " min" else: return str(sec/(60*60)) + " hr" # Target log path logs_path = '/tmp/tensorflow/rnn_words' writer = tf.summary.FileWriter(logs_path) 
# Text file containing words for training training_file = 'Story.txt' def read_data(fname): with open(fname) as f: content = f.readlines() content = [x.strip() for x in content] content = [content[i].split() for i in range(len(content))] content = np.array(content) content = np.reshape(content, [-1, ]) return content training_data = read_data(training_file) print("Loaded training data...") def build_dataset(words): count = collections.Counter(words).most_common() dictionary = dict() for word, _ in count: dictionary[word] = len(dictionary) reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) return dictionary, reverse_dictionary dictionary, reverse_dictionary = build_dataset(training_data) vocab_size = len(dictionary) 
# Parameters learning_rate = 0.001 training_iters = 50000 display_step = 1000 n_input = 3 
# number of units in RNN cell n_hidden = 512 
# tf Graph input x = tf.placeholder("float", [None, n_input, 1]) y = tf.placeholder("float", [None, vocab_size]) 
# RNN output node weights and biases weights = { 'out': tf.Variable(tf.random_normal([n_hidden, vocab_size])) } biases = { 'out': tf.Variable(tf.random_normal([vocab_size])) } def RNN(x, weights, biases): 
# reshape to [1, n_input] x = tf.reshape(x, [-1, n_input])
# Generate a n_input-element sequence of inputs 
# (eg. [had] [a] [general] -> [20] [6] [33])
    x = tf.split(x,n_input,1)
 
    # 2-layer LSTM, each layer has n_hidden units.
    # Average Accuracy= 95.20% at 50k iter
    rnn_cell = rnn.MultiRNNCell([rnn.BasicLSTMCell(n_hidden),rnn.BasicLSTMCell(n_hidden)])
 
    # 1-layer LSTM with n_hidden units but with lower accuracy.
    # Average Accuracy= 90.60% 50k iter
    # Uncomment line below to test but comment out the 2-layer rnn.MultiRNNCell above
    # rnn_cell = rnn.BasicLSTMCell(n_hidden)
 
    # generate prediction
    outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)
 
    # there are n_input outputs but
    # we only want the last output
    return tf.matmul(outputs[-1], weights['out']) + biases['out']
 
pred = RNN(x, weights, biases)
 
# Loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate).minimize(cost)
 
# Model evaluation
correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
 
# Initializing the variables
init = tf.global_variables_initializer()
 
# Launch the graph
with tf.Session() as session:
    session.run(init)
    step = 0
    offset = random.randint(0,n_input+1)
    end_offset = n_input + 1
    acc_total = 0
    loss_total = 0
 
    writer.add_graph(session.graph)
 
    while step < training_iters: # Generate a minibatch. Add some randomness on selection process. if offset > (len(training_data)-end_offset):
            offset = random.randint(0, n_input+1)
 
        symbols_in_keys = [ [dictionary[ str(training_data[i])]] for i in range(offset, offset+n_input) ]
        symbols_in_keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])
 
        symbols_out_onehot = np.zeros([vocab_size], dtype=float)
        symbols_out_onehot[dictionary[str(training_data[offset+n_input])]] = 1.0
        symbols_out_onehot = np.reshape(symbols_out_onehot,[1,-1])
 
        _, acc, loss, onehot_pred = session.run([optimizer, accuracy, cost, pred], \
                                                feed_dict={x: symbols_in_keys, y: symbols_out_onehot})
        loss_total += loss
        acc_total += acc
        if (step+1) % display_step == 0:
            print("Iter= " + str(step+1) + ", Average Loss= " + \
                  "{:.6f}".format(loss_total/display_step) + ", Average Accuracy= " + \
                  "{:.2f}%".format(100*acc_total/display_step))
            acc_total = 0
            loss_total = 0
            symbols_in = [training_data[i] for i in range(offset, offset + n_input)]
            symbols_out = training_data[offset + n_input]
            symbols_out_pred = reverse_dictionary[int(tf.argmax(onehot_pred, 1).eval())]
            print("%s - [%s] vs [%s]" % (symbols_in,symbols_out,symbols_out_pred))
        step += 1
        offset += (n_input+1)
    print("Optimization Finished!")
    print("Elapsed time: ", elapsed(time.time() - start_time))
    print("Run on command line.")
    print("\ttensorboard --logdir=%s" % (logs_path))
    print("Point your web browser to: http://localhost:6006/")
    while True:
        prompt = "%s words: " % n_input
        sentence = input(prompt)
        sentence = sentence.strip()
        words = sentence.split(' ')
        if len(words) != n_input:
            continue
        try:
            symbols_in_keys = [dictionary[str(words[i])] for i in range(len(words))]
            for i in range(32):
                keys = np.reshape(np.array(symbols_in_keys), [-1, n_input, 1])
                onehot_pred = session.run(pred, feed_dict={x: keys})
                onehot_pred_index = int(tf.argmax(onehot_pred, 1).eval())
                sentence = "%s %s" % (sentence,reverse_dictionary[onehot_pred_index])
                symbols_in_keys = symbols_in_keys[1:]
                symbols_in_keys.append(onehot_pred_index)
            print(sentence)
        except:
            print("Word not in dictionary")

This brings us to the end of our article on “Recurrent Neural Networks”. I hope you found this article informative and added value to your knowledge.

If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Deep Learning.

1. TensorFlow Tutorial

2. PyTorch Tutorial

3. Perceptron learning Algorithm

4. Neural Network Tutorial

5. What is Backpropagation?

6. Convolutional Neural Networks

7. Capsule Neural Networks

8. Object Detection in TensorFlow

9. Autoencoders Tutorial

10. Restricted Boltzmann Machine Tutorial

11. PyTorch vs TensorFlow

12. Deep Learning With Python

13. Artificial Intelligence Tutorial

14. TensorFlow Image Classification

15. Artificial Intelligence Applications

16. How to Become an Artificial Intelligence Engineer?

17. Q Learning

18. Apriori Algorithm

19. Markov Chains With Python

20. Artificial Intelligence Algorithms

21. Best Laptops for Machine Learning

22. Top 12 Artificial Intelligence Tools

23. Artificial Intelligence (AI) Interview Questions

24. Theano vs TensorFlow

25. What Is A Neural Network?

26. Pattern Recognition

27. Alpha Beta Pruning in Artificial Intelligence

Originally published at www.edureka.co on November 28, 2018.