An Intro to Collaborative Filtering for Movie Recommendation

Step-by-Step Guide to Recommender System

Recommender System Cheatsheet (image from author’s website)

Recommender system has become a rising topic as we demand more customized contents push to our daily feeds. I guess we are all familiar with the recommended videos on YouTube, and we are all — more than once — the victims of late-night Netflix binge watching.

There are two popular methods in recommender system, collaborative based filtering and content based filtering. Content based filtering makes predictions of what the audience is likely to prefer based on the content properties, e.g. genre, language, video length. Whereas collaborative filtering predicts based on what other similar users also prefer. As the result, collaborative filtering method is leaning towards instance based learning and usually applied by large companies with huge amount of data at hand.

In this article, I will focus on collaborative based filtering and briefly introduce how to make movie recommendation using two algorithms that fall into this category, K Nearest Neighbour (KNN) and Singular Value Decomposition (SVD).

I used the movie dataset from Kaggle to predict recommended movies at individual level.

EDA for Recommender System

Each machine learning algorithm requires different way to explore the dataset to get valuable insights. I performed following three techniques to explore the data at hand. To see a more comprehensive guide of EDA, please check out my blog.

Semi-Automated Exploratory Data Analysis (EDA) in Python

Comprehensive Data Exploration Process with One-Click

towardsdatascience.com

1. Unique Counts and Data Shape

Firstly, an overview of how many distinct users and movies are included in the dataset. This can be easily achieved using df.nunique(axis = 0)and then plot it in a bar chart.

2. Univariate Analysis

Univariate analysis — the analysis of one feature at a time — helps us to better understand three questions:

what are the movies with most reviews?
who are the users that provide most reviews?
how does the distribution looks like for ratings?

# univariate analysis
plt.figure(1, figsize = (16,4))
df['movieId'].value_counts()[:50].plot(kind = 'bar') #take top 50 movies
plt.figure(2, figsize = (16,4))
df['userId'].value_counts()[:50].plot(kind = 'bar') #take top 50 users
plt.figure(3, figsize = (8,4))
df['rating'].plot(kind = 'hist')

univariate analysis result (image by author)

Some of the insights we can draw from the univariate analysis are:

1. ratings are not evenly distributed among movies and the most rated movies is “356” which has no more than 350 ratings;

2. ratings are not evenly distributed across users and users at most provided around 2,400 ratings;

3. most people are likely to give a rating around 4

3. Aggregated Analysis

Univariate analysis gives us a view more at individual movie or users level, whereas aggregated analysis helps us to understand the data on the meta-level.

1. What is the distribution of ratings given to each movie?

ratings_per_user = df.groupby('userId')['movieId'].count() ratings_per_user.hist()

The histogram shows that most users (roughly 560 out of 671 –80%) have less than 250 ratings.

2. What is the distribution of users who provide ratings?

ratings_per_movie = df.groupby('movieId')['userId'].count() ratings_per_movie.hist()

The histogram shows that most movies (roughly 8,200 out of 9,066 –90%) have less than 25 ratings.

At this stage, we should have a fairly clear understanding of the data at hand.

Collaborative -Based Filtering Algorithms

I would like to introduce two collaborative based filtering algorithms — K nearest neighbor and Singular value decomposition. The surprise library allows us to implement both algorithms in just several lines of code.

from surprise import KNNWithMeans
from surprise import SVD

# KNN
similarity = {
    "name": "cosine",
    "user_based": False,  # item-based similarity
}
algo_KNN = KNNWithMeans(sim_options = similarity)

# SVD
algo_SVD = SVD()

But it’s always better to have a basic knowledge of the theory behind each algorithm in order to implement it appropriately.

1. K Nearest Neighbour (KNN)

KNN illustration (image from author’s website)

The option user_based: False determines that this KNN uses item-based similarity, so that we are predicting the unknown ratings of item m1 based on similar items with known ratings. You can think of k nearest neighbour algorithm as representing movie items in a n dimensional space defined by n users. The distance among points are calculated based on cosine similarity — which is determined by the angle between two vectors (as shown m1 and m2 in the diagram). Cosine similarity is preferred instead of Euclidean distance, because it suffers less when the dataset is high in dimensionality.

2. Matrix Factorization — Singular Value Decomposition (SVM)

SVD illustration (image from author’s website)

Singular Value Decomposition is a matrix factorization technique that decomposes the matrix into the product of lower dimensionality matrices, and then extracts the latent features from highest importance to lowest. It is quite a long sentence, so let me break it down …

Instead of iterating through individual ratings like KNN, it views the rating matrix as a whole. Therefore, it has less computation cost compared to KNN but also makes it less interpretable.

SVD extract the latent features (which is not an actual features contained in the dataset, but what the algorithm magically discovered as valuable hidden features) to form the factorized matrices U and V transposed, and placed them in a descending feature importance order — just like from dark blue to light blue in the diagram. It then fills in the blank ratings by taking the product of U and V transposed in a weighted approach based on feature importance. These latent feature parameters are learned iteratively through minimizing the error.

Speaking of error, now let’s talk about model evaluation.

Model Evaluation

Collaborative filtering technique represents recommender system as a regression model, where the output is a numeric rating value. So we can apply regression evaluation metrics to our recommendation system. If you would like to dive deeper into common evaluation metrics for regression, e.g. linear regression, you may find the model evaluation section in the “A Simple and Practical Guide to Linear Regression” helpful.

A Practical Guide to Linear Regression

From EDA to Feature Engineering to Model Evaluation

towardsdatascience.com

In this exercise, I evaluate both KNN and SVD in following two methods.

1. Cross Validation

Surprise library offers a cross_validate function that executes cross validation automatically. In order to let the Surprise library understand the dataset, we need to ingest the dataset into Surprise Reader object using load_from_dfand keep the rating scale between 0 and 5.

from surprise import Dataset
from surprise import Reader

# load df into Surprise Reader object
reader = Reader(rating_scale = (0,5))
rating_df = Dataset.load_from_df(df[['userId','movieId','rating']], reader)

Then pass both algo_KNN and algo_SVD into the cross_validate function with 5 cross validation folds.

from surprise.model_selection import cross_validate

cross_validate_KNN = cross_validate(algo_KNN, rating_df, measures=['RMSE', 'MAE'], cv=5, verbose=True)

cross_validate_SVD = cross_validate(algo_SVD, rating_df, measures=['RMSE', 'MAE'], cv=5, verbose=True)

The result shows the comparison between KNN and SVD. As shown, SVD has smaller RMSE, MAE values, hence performs better than SVD, and also takes significantly less time to compute.

2. Train-Test Split Evaluation

This approach splits dataset into 80% for training and 20% for testing. Instead of iterating the model build 5 times as in cross validation, it will only train the model once and test it once. I have defined the function train_test_algo to print out RMSE, MAE, MSE and return the test dataframe.

from surprise.model_selection import train_test_split
from surprise import accuracy

# define train test function
def train_test_algo(algo, label):
    training_set, testing_set = train_test_split(rating_df, test_size = 0.2)
    algo.fit(training_set)
    test_output = algo.test(testing_set)
    test_df = pd.DataFrame(test_output)
    
    print("RMSE -",label, accuracy.rmse(test_output, verbose = False))
    print("MAE -", label, accuracy.mae(test_output, verbose=False))
    print("MSE -", label, accuracy.mse(test_output, verbose=False))
    
    return test_df

Let’s compare the model accuracy and have a glimpse of the test output.

train_test_KNN = train_test_algo(algo_KNN, "algo_KNN")
print(train_test_KNN.head())

train_test_SVD = train_test_algo(algo_SVD, "algo_SVD")
print(train_test_SVD.head())

The result is very similar to the cross validation, indicating that SVD has less error.

Provide Top Recommendation

It is not enough just building the model. As you can see above, the current test output only predicts ratings for users or movies randomly allocated to the test set, and we also want to see the actual recommendation with movie names.

Firstly, let’s load the movie metadata table and links table, so that we can translate movieId into movie name.

movie_df = pd.read_csv("../input/the-movies-dataset/movies_metadata.csv")
links_df = pd.read_csv("../input/the-movies-dataset/links.csv")
movie_df['imdb_id'] = movie_df['imdb_id'].apply(lambda x: str(x)[2:].lstrip("0"))
links_df['imdbId'] = links_df['imdbId'].astype(str)

Here is a diagram of how three dataframes link together.

merging tables (image by author from website)

Then I defined a prediction(algo, users_K)function that allows you to create a dataframe for K number of users that you are interested in and iterate through all 9067 movies in the dataset while calling the prediction algorithm.

def prediction(algo, users_K):
    pred_list = []
    for userId in range(1,users_K):
        for movieId in range(1,9067):
            rating = algo.predict(userId, movieId).est
            pred_list.append([userId, movieId, rating])
    pred_df = pd.DataFrame(pred_list, columns = ['userId', 'movieId', 'rating'])
    return pred_df

Lastly, top_recommendation(pred_df, top_N)performs following procedure:

1) merges the dataset together using pd_merge();

2) group the ratings by userId and sort it by rating value in a descending order using sort_values();

3) get the top values using head();

4) return both the sorted recommendations and the top recommended movies

def top_recommendations(pred_df, top_N):
    link_movie = pd.merge(pred_df, links_df, how='inner', left_on='movieId', right_on='movieId')
    recommended_movie = pd.merge(link_movie, movie_df, how='left', left_on='imdbId', right_on='imdb_id')[['userId', 'movieId', 'rating', 'movieId','imdb_id','title']]
    sorted_df = recommended_movie.groupby(('userId'), as_index = False).apply(lambda x: x.sort_values(['rating'], ascending = False)).reset_index(drop=True)
    top_recommended_movies = sorted_df.groupby('userId').head(top_N)
    return sorted_df, top_recommended_movies

As a side note, when apply merge in dataframe, we need to be more mindful of datatype of the keys that are joined together, or else you will unexpectedly get a lot of empty result.

Lastly, compare the top 3 predictions of each user given by the KNN vs. SVD.

# KNN predictions
pred_KNN = prediction(algo_KNN, 10)
recommended_movies_KNN, top_recommended_movies_KNN = top_recommendations(pred_KNN, 3)

## SVD predictions
pred_SVD = prediction(algo_SVD, 10)
recommended_movies_SVD, top_recommended_movies_SVD = top_recommendations(pred_SVD, 3)

KNN vs. SVD recommendations (image from author’s website)

As you can see two algorithms give different recommendations and KNN appears to be much more generous in terms of ratings.

Hope you enjoy this article and thanks for reaching this far! If you would like to access the full code please visit the Code Snippet on my website. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.

Take-Home Message

This article takes you through the procedure of building a recommender system and compare the recommendations provided by KNN vs. SVD.

A top-line summary:

EDA for Recommender System: univariate analysis, aggregated analysis
Two Collaborative Based Filtering Algorithm: K Nearest Neighbour vs. Singular Value Decomposition
Model Evaluation: cross validation vs. train-test split
Provide Top Recommendations

An Intro to Collaborative Filtering for Movie Recommendation

Step-by-Step Guide to Recommender System

EDA for Recommender System

Semi-Automated Exploratory Data Analysis (EDA) in Python

Comprehensive Data Exploration Process with One-Click

1. Unique Counts and Data Shape

2. Univariate Analysis

3. Aggregated Analysis

Collaborative -Based Filtering Algorithms

1. K Nearest Neighbour (KNN)

2. Matrix Factorization — Singular Value Decomposition (SVM)

Model Evaluation

A Practical Guide to Linear Regression

From EDA to Feature Engineering to Model Evaluation

1. Cross Validation

2. Train-Test Split Evaluation

Provide Top Recommendation

Take-Home Message

More Articles Like This

Clustering Algorithm for Customer Segmentation

A Step by Step Guide to K-Means Clustering

A Practical Guide to Linear Regression

From EDA to Feature Engineering to Model Evaluation

Simple Logistic Regression in Python

Step-by-Step Guide from Data Preprocessing to Model Evaluation