NLP-Day 14: Get Loopy With Recurrent Neural Networks (Part 2)
Text classification with Recurrent Neural Networks
Recurrent Neural Networks #30DaysOfNLP [Image by Author]
Yesterday, we covered the underlying concepts and mechanisms of a Recurrent Neural Network.
We did this, however, just in theory.
Now it’s time to start coding, get our hands dirty, and implement a Recurrent Neural Network for ourselves.
In the following sections, we’re going to utilize the Keras API once again and create our own Recurrent Neural Network.
We will use the IMDB Movie Review Dataset from the previous episodes and classify the reviews as either being positive or negative. Using the same dataset allows us to compare the results between a Convolutional and a Recurrent Neural Network.
So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: Get Loopy With Recurrent Neural Networks (Part 2)
In need of a dataset
Same ol’ same ol’.
No data. No model. No prediction.
Thus, we need to get our hands on a dataset. Luckily, we can utilize the IMDB Movie Review Dataset provided by Stanford University. Once, downloaded here, we can start our preprocessing journey.
Nothing fancy so far. We import the tarfile module, open and read the archive. Next, we extractall() all contents, allowing us to make use of the training data.
Creating the first dataset
Now, that we have the files extracted, we need to create a dataset we can process further.
Thus, we define a helper function that returns a basic list.
After specifying the “positive and negative” subfolders and a simple regex pattern to strip the text of unnecessary HTML tags, we’re ready to proceed.
Inside our second helper function append_data(), we simply iterate over each file inside the subfolder and append the text as well as the associated label to our dataset.
Once we finished both iterations, we can return the dataset.
Vectorizing the data
Although, we already created a decent dataset by now. We need to work on the text representation. We need to vectorize the data.
We define yet another helper function that returns the vectorized data as well as the corresponding target label.
First of all, we instantiate the TreebankWordTokenizer and load the pre-trained “Google-News” word vector.
Next, we can iterate over the dataset and extract the tokens from each sample. Iterating over the list of tokens, we represent each token with the associated word vector. We successfully vectorized our data.
In the end, we simply return the vectorized data as well as the target label.
Some more preprocessing
The reviews come in different lengths.
We, however, need a homogenous representation of the input data. This means we have to either truncate or pad the input data appropriately.
Business as usual.
We define a new function that returns the preprocessed data as a list.
Inside the function, we simply iterate over the dataset and check for each sample if it’s longer or shorter than the specified maximum length of 400 tokens.
If our sample contains more tokens than the maximum length, we simply drop the last tokens. If it’s shorter, we append zero vectors in the same shape as the word vector, until we reach the required length.
Next, we return the padded or truncated data.
Before we can finally start building and training our model, we need to do one last step. We have to split our dataset into training and test data as well as reshape it into numpy.arrays in order to make Keras happy.
Finally, it’s training time
With all the preprocessing steps out of our way, we can finally build our Recurrent Neural Network.
First of all, we define an input layer with the shape of 400 rows and 300 columns. The rows are the maximum number of tokens per sentence and the columns represent the word vector dimensions.
Next, we add a SimpleRNN layer with 50 neurons. We set return_sequences=True which tells our model to return the network value at each time step, resulting in 400 vectors with a length of 50 each.
The SimpleRNN layer is followed by a dropout and a flatten layer. The flatten layer flattens, as the name suggests, the previous output (400x50) so that it can be preprocessed by a dense layer.
The final layer is a dense layer with 1 unit and a sigmoid activation function. This layer outputs the probability of a review either being positive or negative.
Before we can run our training process we need to compile the model.
Now, we have a model only waiting to be trained.
After setting the random seed, we start the training process by calling the fit() function. We run this process for 2 epochs and receive the following result.
A validation accuracy of ~0.81 which unfortunately is not an improvement compared to the score of ~0.87 we obtained after training our Convolutional Neural Network.
Our score is even worse. We could improve our network by tuning the hyperparameters and specifying a different maximum length of tokens. We chose this number arbitrarily, which can lead to either excessive padding or too much information loss when dropping a lot of tokens in a sentence.
Conclusion
In this article, we implemented our own Recurrent Neural Network. We utilized the same dataset as before and classified movie reviews as either being positive or negative.
However, we did not improve. In fact, we performed even worse.
So why should we bother with Recurrent Neural Networks at all?
Although we couldn't improve the accuracy score we encountered a fundamental and absolutely crucial concept in Natural Language Processing: Memory.
In the next article, we tweak our understanding of memory by taking a look at the concept of Long-Short-Term-Memory.
So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.
Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.