Data Science with Python — NLP Use Case

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Data Science with Python

Aka the best programming language for data scientists

medium.com

In a previous article, I talked about implementing NLP in Python. Today, we’ll see a concrete application of NLP using IMDb reviews.

The objective will be to predict a rating from a review.

Data Collection and Preparation

The first step in any data analysis project is to collect and prepare the data.

The IMDb dataset is a collection of movie reviews, ratings, and other metadata compiled by the Internet Movie Database (IMDb). The dataset contains over 50,000 movie reviews, divided into a training set of 25,000 reviews and a testing set of 25,000 reviews. Each review is labeled as positive or negative based on the sentiment expressed in the text.

To download the IMDb dataset, we can use the imdbpackage (installed with pip install cinemagoer). It allows loading IMDb reviews using the IMDb API. Here is an example of how you can generate a dataset and save it for further use:

from imdb import Cinemagoer
import pandas as pd
from tqdm import tqdm


def generate_dataset():

    # create an instance of the IMDb class
    ia = Cinemagoer()

    # search for the "top-rated" movies and retrieve their IDs
    top250 = ia.get_top250_movies()
    movie_ids = [m.getID() for m in top250]

    # retrieve the movie reviews for each ID
    reviews = []
    for mid in tqdm(movie_ids):
        movie = ia.get_movie_reviews(mid)
        movie_data = movie['data']
        for review in movie_data['reviews']:
            reviews.append(review['content'])

    # create a pandas DataFrame from the reviews
    df = pd.DataFrame({'review': reviews})

    # save the DataFrame as a CSV file
    df.to_csv('reviews.csv', index=False)

Note: I use tqdm to see the progress of the dataset generation.

Then, we can just write a function to load our dataset:

def load_dataset():
    df = pd.read_csv('reviews.csv')
    return df

Before we can start the analysis, we need to preprocess the text data to remove noise and irrelevant information. This involves converting the text to lowercase, removing stopwords (common words that do not add meaning to the text, such as “the”, “and”, “a”), removing punctuation marks and special characters, and removing HTML tags if any. We can use the Natural Language Toolkit (NLTK) library in Python to perform these preprocessing steps.

import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string


def preprocess_text(text):
    text = text.lower()

    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    words = [w for w in words if w not in stop_words]

    # remove punctuation marks and special characters
    table = str.maketrans('', '', string.punctuation + '\r\n\t')
    words = [w.translate(table) for w in words]

    # join the words back into a single string
    text = ' '.join(words)
    return text

df['review'] = df['review'].apply(preprocess_text)

After the text data has been cleaned and preprocessed, we can tokenize the text into individual words or tokens. Tokenization involves splitting the text into words or subwords, which can be used as features for machine learning models. Here we won’t need it, but I will still show you how to do it:

def tokenize_text(text):
    # use the word_tokenize function to split the text into words
    words = word_tokenize(text)
    return words

df['tokens'] = df['review'].apply(tokenize_text)

Exploratory Data Analysis

After loading and preprocessing the IMDb review dataset, the next step is to perform exploratory data analysis (EDA) to gain insights and understanding of the data. EDA involves using statistical and visual methods to summarize and analyze the data.

We’ll need matplotlib and seaborn:

import matplotlib.pyplot as plt
import seaborn as sns

First, let’s look at the distribution of movie ratings in the dataset:

sns.histplot(df['rating'], bins=10)
plt.title('Distribution of Movie Ratings')
plt.show()

From this distribution, we can tell that our dataset is bad. An ideal dataset would include a more homogeneous distribution of ratings. Here, our model will not be able to predict the ratings correctly because there is not enough variation in the training data. In theory, we should have bothered to take other reviews in order to have a homogeneous distribution, but let’s not bother with this for the example.

Next, we can analyze the distribution of review lengths:

# calculate the length of each review in words
df['review_length'] = df['tokens'].apply(len)

# plot the distribution of review lengths
sns.histplot(df['review_length'], bins=50)
plt.title('Distribution of Review Lengths')
plt.show()

We can also analyze the most common words in the reviews:

# create a Counter object to count the occurrences of each word
word_count = Counter()
for tokens in df['tokens']:
    word_count.update(tokens)

most_common = word_count.most_common(20)

# plot the most common words
sns.barplot(x=[w[0] for w in most_common], y=[w[1] for w in most_common])
plt.title('Most Common Words')
plt.xticks(rotation=90)
plt.show()

Finally, we can analyze the correlation between review length and movie rating:

sns.scatterplot(x='review_length', y='rating', data=df)
plt.title('Correlation between Review Length and Movie Rating')
plt.show()

Feature Extraction

After performing exploratory data analysis on the IMDb review dataset, the next step is to extract features from the preprocessed text data. Feature extraction is the process of converting the text data into numerical features that can be used as input for machine learning models.

There are several methods for feature extraction in natural language processing (NLP), including bag-of-words, TF-IDF, and word embeddings. I will show you how to use the bag-of-words approach.

This approach represents each document as a vector of word frequencies. We can use the CountVectorizer class from the scikit-learn library to convert the preprocessed text data into a bag-of-words representation:

df = df.dropna()

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(df['review'])

Model Building

Before building our model, let’s create our train training and testing sets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(bow, df['rating'], test_size=0.2, random_state=4

Now, we can build our model, fit it, and make predictions:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Finally, we can evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average='macro'))
print("Recall:", recall_score(y_test, y_pred, average='macro'))
print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

Then, we can try to improve our model.

Fine-Tuning

Fine-tuning is the process of adjusting the hyperparameters of the machine learning model to optimize its performance. In the context of NLP, this can involve tuning the parameters of the feature extraction method or the machine learning algorithm itself.

One way to fine-tune our model is to experiment with different hyperparameters such as the regularization strength and solver algorithm. For example, we can try different values of the C parameter, which controls the strength of the regularization:

    best_score = 0
    best_params = None
    for c in [0.01, 0.1, 1, 10, 100]:
        lr = LogisticRegression(max_iter=1000, C=c, solver='lbfgs', multi_class='auto')
        lr.fit(X_train, y_train)
        y_pred = lr.predict(X_test)
        score = f1_score(y_test, y_pred, average='macro')
        if score > best_score:
            best_score = score
            best_params = {'C': c}
    print("Best score:", best_score)
    print("Best params:", best_params)

    # train the model with the best parameters
    lr = LogisticRegression(max_iter=1000, C=best_params['C'], solver='lbfgs', multi_class='auto')
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred, average='macro'))
    print("Recall:", recall_score(y_test, y_pred, average='macro'))
    print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

Another way is to try another algorithm, so instead of using the logistic regression algorithm, we can use the CountVectorizer , we can use the TfidfVectorizer :

    tfidf = TfidfVectorizer()
    tfidf_bow = tfidf.fit_transform(df['review'])

    X_train, X_test, y_train, y_test = train_test_split(tfidf_bow, df['rating'], test_size=0.2, random_state=42)

    lr = LogisticRegression(max_iter=1000)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred, average='macro'))
    print("Recall:", recall_score(y_test, y_pred, average='macro'))
    print("F1 Score:", f1_score(y_test, y_pred, average='macro'))

Final Note

I provided just one example of a NLP tasks. There are many others such as sentiment analysis or topic modeling, maybe I’ll cover them later.

To be sure to don’t miss the next articles of this series, feel free to follow me!

To explore the other stories of this series, click below!