Recommendation System: User-Based Collaborative Filtering
Python user-user collaborative filtering to recommend items based on user similarities
User-based collaborative filtering is also called user-user collaborative filtering. It is a type of recommendation system algorithm that uses user similarity to make product recommendations.
In this tutorial, we will talk about
- What is user-based (user-user) collaborative filtering?
- How to create a user-product matrix?
- How to process data for user-based collaborative filtering?
- How to identify similar users?
- How to narrow down the items pool?
- How to rank items for the recommendation?
- How to predict the rating score?
Resources for this post:
- Video tutorial on YouTube
- Python code is at the end of the post. Click here for the notebook.
- More video tutorials on recommendation system
- More blog posts on recommendation system
Let’s get started!
Step 0: User-Based Collaborative Filtering Recommendation Algorithm
Firstly, let’s understand how User-based collaborative filtering works.
User-based collaborative filtering makes recommendations based on user-product interactions in the past. The assumption behind the algorithm is that similar users like similar products.
User-based collaborative filtering algorithm usually has the following steps:
- Find similar users based on interactions with common items.
- Identify the items rated high by similar users but have not been exposed to the active user of interest.
- Calculate the weighted average score for each item.
- Rank items based on the score and pick the top n items to recommend.
This graph illustrates how user-based collaborative filtering works using a simplified example.
- Ms. Blond likes apples. Ms. Black likes watermelon and pineapple. Ms. Purple likes watermelon and grapes.
- Because Ms. Black and Ms. Purple like the same fruit, watermelon, they are similar users.
- Since Ms. Black likes pineapple and Ms. Purple has not been exposed to pineapple yet, the recommendation system recommends pineapple to Ms. purple.
Step 1: Import Python Libraries
In the first step, we will import Python libraries pandas
, numpy
, and scipy.stats
. These three libraries are for data processing and calculations.
We also imported seaborn
for visualization and cosine_similarity
for calculating similarity scores.
# Data processing
import pandas as pd
import numpy as np
import scipy.stats
# Visualization
import seaborn as sns
# Similarity
from sklearn.metrics.pairwise import cosine_similarity
Step 2: Download And Read Data
This tutorial uses the movielens dataset. This dataset contains actual user ratings of movies.
In step 2, we will follow the steps below to get the datasets:
- Go to https://grouplens.org/datasets/movielens/
- Download the 100k dataset with the file name “ml-latest-small.zip”
- Unzip “ml-latest-small.zip”
- Copy the “ml-latest-small” folder to your project folder
Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.
Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
# Change directory
import os
os.chdir("drive/My Drive/contents/recommendation_system")
# Print out the current directory
!pwd
Output
Mounted at /content/drive
/content/drive/My Drive/contents/recommendation_system
There are multiple datasets in the 100k movielens folder. For this tutorial, we will use two ratings and movies.
Now let’s read the rating data.
# Read in data
ratings=pd.read_csv('ml-latest-small/ratings.csv')
# Take a look at the data
ratings.head()
There are four columns in the ratings dataset, userID, movieID, rating, and timestamp.
The dataset has over 100k records, and there is no missing data.
# Get the dataset information
ratings.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 100836 non-null int64
1 movieId 100836 non-null int64
2 rating 100836 non-null float64
3 timestamp 100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB
The 100k ratings are from 610 users on 9724 movies. The rating has ten unique values from 0.5 to 5.
# Number of users
print('The ratings dataset has', ratings['userId'].nunique(), 'unique users')
# Number of movies
print('The ratings dataset has', ratings['movieId'].nunique(), 'unique movies')
# Number of ratings
print('The ratings dataset has', ratings['rating'].nunique(), 'unique ratings')
# List of unique ratings
print('The unique ratings are', sorted(ratings['rating'].unique()))
Output
The ratings dataset has 610 unique users
The ratings dataset has 9724 unique movies
The ratings dataset has 10 unique ratings
The unique ratings are [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
Next, let’s read in the movies data to get the movie names.
# Read in data
movies = pd.read_csv('ml-latest-small/movies.csv')
# Take a look at the data
movies.head()
The movies dataset has movieID, title, and genres.
Using movieID
as the matching key, we appended movie information to the rating dataset and named it df
. So now we have the movie title and movie rating in the same dataset!
# Merge ratings and movies datasets
df = pd.merge(ratings, movies, on='movieId', how='inner')
# Take a look at the data
df.head()
Output
Step 3: Exploratory Data Analysis (EDA)
In step 3, we need to filter the movies and keep only those with over 100 ratings for the analysis. This is to make the calculation manageable by the Google Colab memory.
To do that, we first group the movies by title, count the number of ratings, and keep only the movies with greater than 100 ratings.
The average ratings for the movies are calculated as well.
# Aggregate by movie
agg_ratings = df.groupby('title').agg(mean_rating = ('rating', 'mean'),
number_of_ratings = ('rating', 'count')).reset_index()
# Keep the movies with over 100 ratings
agg_ratings_GT100 = agg_ratings[agg_ratings['number_of_ratings']>100]
agg_ratings_GT100.info()
From the .info()
output, we can see that there are 134 movies left.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 74 to 9615
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 134 non-null object
1 mean_rating 134 non-null float64
2 number_of_ratings 134 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 4.2+ KB
Let’s check the most popular movies and their ratings.
# Check popular movies
agg_ratings_GT100.sort_values(by='number_of_ratings', ascending=False).head()
Next, let’s use a jointplot
to check the correlation between the average rating and the number of ratings.
We can see an upward trend from the scatter plot, showing that popular movies get higher ratings.
The average rating distribution shows that most movies in the dataset have an average rating of around 4.
The number of rating distribution shows that most movies have less than 150 ratings.
# Visulization
sns.jointplot(x='mean_rating', y='number_of_ratings', data=agg_ratings_GT100)
To keep only the 134 movies with more than 100 ratings, we need to join the movie with the user-rating level dataframe.
how='inner'
and on='title'
ensure that only the movies with more than 100 ratings are included.
# Merge data
df_GT100 = pd.merge(df, agg_ratings_GT100[['title']], on='title', how='inner')
df_GT100.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 19788 entries, 0 to 19787
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userId 19788 non-null int64
1 movieId 19788 non-null int64
2 rating 19788 non-null float64
3 timestamp 19788 non-null int64
4 title 19788 non-null object
5 genres 19788 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 1.1+ MB
After filtering the movies with over 100 ratings, we have 597 users that rated 134 movies.
# Number of users
print('The ratings dataset has', df_GT100['userId'].nunique(), 'unique users')
# Number of movies
print('The ratings dataset has', df_GT100['movieId'].nunique(), 'unique movies')
# Number of ratings
print('The ratings dataset has', df_GT100['rating'].nunique(), 'unique ratings')
# List of unique ratings
print('The unique ratings are', sorted(df_GT100['rating'].unique()))
The ratings dataset has 597 unique users
The ratings dataset has 134 unique movies
The ratings dataset has 10 unique ratings
The unique ratings are [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
Step 4: Create User-Movie Matrix
In step 4, we will transform the dataset into a matrix format. The rows of the matrix are users, and the columns of the matrix are movies. The value of the matrix is the user rating of the movie if there is a rating. Otherwise, it shows ‘NaN’.
# Create user-item matrix
matrix = df_GT100.pivot_table(index='userId', columns='title', values='rating')
matrix.head()
Step 5: Data Normalization
Since some people tend to give a higher rating than others, we normalize the rating by extracting the average rating of each user.
After normalization, the movies with a rating less than the user’s average rating get a negative value, and the movies with a rating more than the user’s average rating get a positive value.
# Normalize user-item matrix
matrix_norm = matrix.subtract(matrix.mean(axis=1), axis = 'rows')
matrix_norm.head()
Step 6: Identify Similar Users
There are different ways to measure similarities. Pearson correlation and cosine similarity are two widely used methods.
In this tutorial, we will calculate the user similarity matrix using Pearson correlation.
# User similarity matrix using Pearson correlation
user_similarity = matrix_norm.T.corr()
user_similarity.head()
Those who are interested in using cosine similarity can refer to this code. Since cosine_similarity
does not take missing values, we need to impute the missing values with 0s before the calculation.
# User similarity matrix using cosine similarity
user_similarity_cosine = cosine_similarity(matrix_norm.fillna(0))
user_similarity_cosine
array([[ 1. , 0. , 0. , ..., 0.14893867,
-0.06003146, 0.04528224],
[ 0. , 1. , 0. , ..., -0.04485403,
-0.25197632, 0.18886414],
[ 0. , 0. , 0. , ..., 0. ,
0. , 0. ],
...,
[ 0.14893867, -0.04485403, 0. , ..., 1. ,
0.14734568, 0.07931015],
[-0.06003146, -0.25197632, 0. , ..., 0.14734568,
1. , -0.14276787],
[ 0.04528224, 0.18886414, 0. , ..., 0.07931015,
-0.14276787, 1. ]])
Now let’s use user ID 1 as an example to illustrate how to find similar users.
We first need to exclude user ID 1 from the similar user list and decide the number of similar users.
# Pick a user ID
picked_userid = 1
# Remove picked user ID from the candidate list
user_similarity.drop(index=picked_userid, inplace=True)
# Take a look at the data
user_similarity.head()
In the user similarity matrix, the values range from -1 to 1, where -1 means opposite movie preference and 1 means same movie preference.
n = 10
means we would like to pick the top 10 most similar users for user ID 1.
The user-based collaborative filtering makes recommendations based on users with similar tastes, so we need to set a positive threshold. Here we set the user_similarity_threshold
to be 0.3, meaning that a user must have a Pearson correlation coefficient of at least 0.3 to be considered as a similar user.
After setting the number of similar users and similarity threshold, we sort the user similarity value from the highest and lowest, then printed out the most similar users’ ID and the Pearson correlation value.
# Number of similar users
n = 10
# User similarity threashold
user_similarity_threshold = 0.3
# Get top n similar users
similar_users = user_similarity[user_similarity[picked_userid]>user_similarity_threshold][picked_userid].sort_values(ascending=False)[:n]
# Print out top n similar users
print(f'The similar users for user {picked_userid} are', similar_users)
The similar users for user 1 are userId
502 1.000000
9 1.000000
598 1.000000
550 1.000000
108 1.000000
401 0.942809
511 0.925820
366 0.872872
595 0.866025
154 0.866025
Name: 1, dtype: float64
Step 7: Narrow Down Item Pool
In step 7, we will narrow down the item pool by doing the following:
- Remove the movies that have been watched by the target user (user ID 1 in this example).
- Keep only the movies that similar users have watched.
To remove the movies watched by the target user, we keep only the row for userId=1
in the user-item matrix and remove the items with missing values.
# Movies that the target user has watched
picked_userid_watched = matrix_norm[matrix_norm.index == picked_userid].dropna(axis=1, how='all')
picked_userid_watched
To keep only the similar users’ movies, we keep the user IDs in the top 10 similar user lists and remove the film with all missing values. All missing value for a movie means that none of the similar users have watched the movie.
# Movies that similar users watched. Remove movies that none of the similar users have watched
similar_user_movies = matrix_norm[matrix_norm.index.isin(similar_users.index)].dropna(axis=1, how='all')
similar_user_movies
Next, we will drop the movies that user ID 1 watched from the similar user movie list. errors='ignore'
drops columns if they exist without giving an error message.
# Remove the watched movie from the movie list
similar_user_movies.drop(picked_userid_watched.columns,axis=1, inplace=True, errors='ignore')
# Take a look at the data
similar_user_movies
Step 8: Recommend Items
In step 8, we will decide which movie to recommend to the target user. The recommended items are determined by the weighted average of user similarity score and movie rating. The movie ratings are weighted by the similarity scores, so the users with higher similarity get higher weights.
This code loops through items and users to get the item score, rank the score from high to low and pick the top 10 movies to recommend to user ID 1.
# A dictionary to store item scores
item_score = {}
# Loop through items
for i in similar_user_movies.columns:
# Get the ratings for movie i
movie_rating = similar_user_movies[i]
# Create a variable to store the score
total = 0
# Create a variable to store the number of scores
count = 0
# Loop through similar users
for u in similar_users.index:
# If the movie has rating
if pd.isna(movie_rating[u]) == False:
# Score is the sum of user similarity score multiply by the movie rating
score = similar_users[u] * movie_rating[u]
# Add the score to the total score for the movie so far
total += score
# Add 1 to the count
count +=1
# Get the average score for the item
item_score[i] = total / count
# Convert dictionary to pandas dataframe
item_score = pd.DataFrame(item_score.items(), columns=['movie', 'movie_score'])
# Sort the movies by score
ranked_item_score = item_score.sort_values(by='movie_score', ascending=False)
# Select top m movies
m = 10
ranked_item_score.head(m)
Step 9: Predict Scores (Optional)
If the goal is to choose the recommended items, having the rank of the items is enough. However, if the goal is to predict the user’s rating, we need to add the user’s average movie rating score back to the movie score.
# Average rating for the picked user
avg_rating = matrix[matrix.index == picked_userid].T.mean()[picked_userid]
# Print the average movie rating for user 1
print(f'The average movie rating for user {picked_userid} is {avg_rating:.2f}')
The average movie rating for user 1 is 4.39
The average movie rating for user 1 is 4.39, so we add 4.39 back to the movie score.
# Calcuate the predicted rating
ranked_item_score['predicted_rating'] = ranked_item_score['movie_score'] + avg_rating
# Take a look at the data
ranked_item_score.head(m)
We can see that the top 10 recommended movies all have predicted ratings greater than 4.5.