Recommendation System: User-Based Collaborative Filtering

Python user-user collaborative filtering to recommend items based on user similarities

User-based collaborative filtering is also called user-user collaborative filtering. It is a type of recommendation system algorithm that uses user similarity to make product recommendations.

In this tutorial, we will talk about

What is user-based (user-user) collaborative filtering?
How to create a user-product matrix?
How to process data for user-based collaborative filtering?
How to identify similar users?
How to narrow down the items pool?
How to rank items for the recommendation?
How to predict the rating score?

Resources for this post:

Video tutorial on YouTube
Python code is at the end of the post. Click here for the notebook.
More video tutorials on recommendation system
More blog posts on recommendation system

Let’s get started!

Step 0: User-Based Collaborative Filtering Recommendation Algorithm

Firstly, let’s understand how User-based collaborative filtering works.

User-based collaborative filtering makes recommendations based on user-product interactions in the past. The assumption behind the algorithm is that similar users like similar products.

User-based collaborative filtering algorithm usually has the following steps:

Find similar users based on interactions with common items.
Identify the items rated high by similar users but have not been exposed to the active user of interest.
Calculate the weighted average score for each item.
Rank items based on the score and pick the top n items to recommend.

This graph illustrates how user-based collaborative filtering works using a simplified example.

Ms. Blond likes apples. Ms. Black likes watermelon and pineapple. Ms. Purple likes watermelon and grapes.
Because Ms. Black and Ms. Purple like the same fruit, watermelon, they are similar users.
Since Ms. Black likes pineapple and Ms. Purple has not been exposed to pineapple yet, the recommendation system recommends pineapple to Ms. purple.

Join Medium with my referral link - Amy @GrabNGoInfo

Read every story from Amy (and thousands of other writers on Medium). Your membership fee directly supports Amy and…

medium.com

Step 1: Import Python Libraries

In the first step, we will import Python libraries pandas, numpy, and scipy.stats. These three libraries are for data processing and calculations.

We also imported seaborn for visualization and cosine_similarity for calculating similarity scores.

# Data processing
import pandas as pd
import numpy as np
import scipy.stats

# Visualization
import seaborn as sns

# Similarity
from sklearn.metrics.pairwise import cosine_similarity

Step 2: Download And Read Data

This tutorial uses the movielens dataset. This dataset contains actual user ratings of movies.

In step 2, we will follow the steps below to get the datasets:

Go to https://grouplens.org/datasets/movielens/
Download the 100k dataset with the file name “ml-latest-small.zip”
Unzip “ml-latest-small.zip”
Copy the “ml-latest-small” folder to your project folder

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.

Please check out Google Colab Tutorial for Beginners for details about using Google Colab for data science projects.

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Change directory
import os
os.chdir("drive/My Drive/contents/recommendation_system")

# Print out the current directory
!pwd

Output

Mounted at /content/drive
/content/drive/My Drive/contents/recommendation_system

There are multiple datasets in the 100k movielens folder. For this tutorial, we will use two ratings and movies.

Now let’s read the rating data.

# Read in data
ratings=pd.read_csv('ml-latest-small/ratings.csv')

# Take a look at the data
ratings.head()

There are four columns in the ratings dataset, userID, movieID, rating, and timestamp.

The dataset has over 100k records, and there is no missing data.

# Get the dataset information
ratings.info()

Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835

Data columns (total 4 columns):
#   Column     Non-Null Count   Dtype
---  ------     --------------   -----
0   userId     100836 non-null  int64
1   movieId    100836 non-null  int64
2   rating     100836 non-null  float64
3   timestamp  100836 non-null  int64

dtypes: float64(1), int64(3)
memory usage: 3.1 MB

The 100k ratings are from 610 users on 9724 movies. The rating has ten unique values from 0.5 to 5.

# Number of users
print('The ratings dataset has', ratings['userId'].nunique(), 'unique users')

# Number of movies
print('The ratings dataset has', ratings['movieId'].nunique(), 'unique movies')

# Number of ratings
print('The ratings dataset has', ratings['rating'].nunique(), 'unique ratings')

# List of unique ratings
print('The unique ratings are', sorted(ratings['rating'].unique()))

Output

The ratings dataset has 610 unique users
The ratings dataset has 9724 unique movies
The ratings dataset has 10 unique ratings
The unique ratings are [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Next, let’s read in the movies data to get the movie names.

# Read in data
movies = pd.read_csv('ml-latest-small/movies.csv')

# Take a look at the data
movies.head()

The movies dataset has movieID, title, and genres.

Using movieID as the matching key, we appended movie information to the rating dataset and named it df. So now we have the movie title and movie rating in the same dataset!

# Merge ratings and movies datasets
df = pd.merge(ratings, movies, on='movieId', how='inner')

# Take a look at the data
df.head()

Output

Step 3: Exploratory Data Analysis (EDA)

In step 3, we need to filter the movies and keep only those with over 100 ratings for the analysis. This is to make the calculation manageable by the Google Colab memory.

To do that, we first group the movies by title, count the number of ratings, and keep only the movies with greater than 100 ratings.

The average ratings for the movies are calculated as well.

# Aggregate by movie
agg_ratings = df.groupby('title').agg(mean_rating = ('rating', 'mean'),
                                                number_of_ratings = ('rating', 'count')).reset_index()

# Keep the movies with over 100 ratings
agg_ratings_GT100 = agg_ratings[agg_ratings['number_of_ratings']>100]
agg_ratings_GT100.info()

From the .info() output, we can see that there are 134 movies left.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 74 to 9615
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   title              134 non-null    object 
 1   mean_rating        134 non-null    float64
 2   number_of_ratings  134 non-null    int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 4.2+ KB

Let’s check the most popular movies and their ratings.

# Check popular movies
agg_ratings_GT100.sort_values(by='number_of_ratings', ascending=False).head()

Next, let’s use a jointplot to check the correlation between the average rating and the number of ratings.

We can see an upward trend from the scatter plot, showing that popular movies get higher ratings.

The average rating distribution shows that most movies in the dataset have an average rating of around 4.

The number of rating distribution shows that most movies have less than 150 ratings.

# Visulization
sns.jointplot(x='mean_rating', y='number_of_ratings', data=agg_ratings_GT100)

To keep only the 134 movies with more than 100 ratings, we need to join the movie with the user-rating level dataframe.

how='inner' and on='title' ensure that only the movies with more than 100 ratings are included.

# Merge data
df_GT100 = pd.merge(df, agg_ratings_GT100[['title']], on='title', how='inner')
df_GT100.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19788 entries, 0 to 19787
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   userId     19788 non-null  int64  
 1   movieId    19788 non-null  int64  
 2   rating     19788 non-null  float64
 3   timestamp  19788 non-null  int64  
 4   title      19788 non-null  object 
 5   genres     19788 non-null  object 
dtypes: float64(1), int64(3), object(2)
memory usage: 1.1+ MB

After filtering the movies with over 100 ratings, we have 597 users that rated 134 movies.

# Number of users
print('The ratings dataset has', df_GT100['userId'].nunique(), 'unique users')

# Number of movies
print('The ratings dataset has', df_GT100['movieId'].nunique(), 'unique movies')

# Number of ratings
print('The ratings dataset has', df_GT100['rating'].nunique(), 'unique ratings')

# List of unique ratings
print('The unique ratings are', sorted(df_GT100['rating'].unique()))

The ratings dataset has 597 unique users
The ratings dataset has 134 unique movies
The ratings dataset has 10 unique ratings
The unique ratings are [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Step 4: Create User-Movie Matrix

In step 4, we will transform the dataset into a matrix format. The rows of the matrix are users, and the columns of the matrix are movies. The value of the matrix is the user rating of the movie if there is a rating. Otherwise, it shows ‘NaN’.

# Create user-item matrix
matrix = df_GT100.pivot_table(index='userId', columns='title', values='rating')
matrix.head()

Step 5: Data Normalization

Since some people tend to give a higher rating than others, we normalize the rating by extracting the average rating of each user.

After normalization, the movies with a rating less than the user’s average rating get a negative value, and the movies with a rating more than the user’s average rating get a positive value.

# Normalize user-item matrix
matrix_norm = matrix.subtract(matrix.mean(axis=1), axis = 'rows')
matrix_norm.head()

Step 6: Identify Similar Users

There are different ways to measure similarities. Pearson correlation and cosine similarity are two widely used methods.

In this tutorial, we will calculate the user similarity matrix using Pearson correlation.

# User similarity matrix using Pearson correlation
user_similarity = matrix_norm.T.corr()
user_similarity.head()

Those who are interested in using cosine similarity can refer to this code. Since cosine_similarity does not take missing values, we need to impute the missing values with 0s before the calculation.

# User similarity matrix using cosine similarity
user_similarity_cosine = cosine_similarity(matrix_norm.fillna(0))
user_similarity_cosine

array([[ 1.        ,  0.        ,  0.        , ...,  0.14893867,
        -0.06003146,  0.04528224],
       [ 0.        ,  1.        ,  0.        , ..., -0.04485403,
        -0.25197632,  0.18886414],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.14893867, -0.04485403,  0.        , ...,  1.        ,
         0.14734568,  0.07931015],
       [-0.06003146, -0.25197632,  0.        , ...,  0.14734568,
         1.        , -0.14276787],
       [ 0.04528224,  0.18886414,  0.        , ...,  0.07931015,
        -0.14276787,  1.        ]])

Now let’s use user ID 1 as an example to illustrate how to find similar users.

We first need to exclude user ID 1 from the similar user list and decide the number of similar users.

# Pick a user ID
picked_userid = 1

# Remove picked user ID from the candidate list
user_similarity.drop(index=picked_userid, inplace=True)

# Take a look at the data
user_similarity.head()

In the user similarity matrix, the values range from -1 to 1, where -1 means opposite movie preference and 1 means same movie preference.

n = 10 means we would like to pick the top 10 most similar users for user ID 1.

The user-based collaborative filtering makes recommendations based on users with similar tastes, so we need to set a positive threshold. Here we set the user_similarity_threshold to be 0.3, meaning that a user must have a Pearson correlation coefficient of at least 0.3 to be considered as a similar user.

After setting the number of similar users and similarity threshold, we sort the user similarity value from the highest and lowest, then printed out the most similar users’ ID and the Pearson correlation value.

# Number of similar users
n = 10

# User similarity threashold
user_similarity_threshold = 0.3

# Get top n similar users
similar_users = user_similarity[user_similarity[picked_userid]>user_similarity_threshold][picked_userid].sort_values(ascending=False)[:n]

# Print out top n similar users
print(f'The similar users for user {picked_userid} are', similar_users)

The similar users for user 1 are userId
502    1.000000
9      1.000000
598    1.000000
550    1.000000
108    1.000000
401    0.942809
511    0.925820
366    0.872872
595    0.866025
154    0.866025
Name: 1, dtype: float64

Step 7: Narrow Down Item Pool

In step 7, we will narrow down the item pool by doing the following:

Remove the movies that have been watched by the target user (user ID 1 in this example).
Keep only the movies that similar users have watched.

To remove the movies watched by the target user, we keep only the row for userId=1 in the user-item matrix and remove the items with missing values.

# Movies that the target user has watched
picked_userid_watched = matrix_norm[matrix_norm.index == picked_userid].dropna(axis=1, how='all')
picked_userid_watched

To keep only the similar users’ movies, we keep the user IDs in the top 10 similar user lists and remove the film with all missing values. All missing value for a movie means that none of the similar users have watched the movie.

# Movies that similar users watched. Remove movies that none of the similar users have watched
similar_user_movies = matrix_norm[matrix_norm.index.isin(similar_users.index)].dropna(axis=1, how='all')
similar_user_movies

Next, we will drop the movies that user ID 1 watched from the similar user movie list. errors='ignore' drops columns if they exist without giving an error message.

# Remove the watched movie from the movie list
similar_user_movies.drop(picked_userid_watched.columns,axis=1, inplace=True, errors='ignore')

# Take a look at the data
similar_user_movies

Step 8: Recommend Items

In step 8, we will decide which movie to recommend to the target user. The recommended items are determined by the weighted average of user similarity score and movie rating. The movie ratings are weighted by the similarity scores, so the users with higher similarity get higher weights.

This code loops through items and users to get the item score, rank the score from high to low and pick the top 10 movies to recommend to user ID 1.

# A dictionary to store item scores
item_score = {}

# Loop through items
for i in similar_user_movies.columns:
  # Get the ratings for movie i
  movie_rating = similar_user_movies[i]
  # Create a variable to store the score
  total = 0
  # Create a variable to store the number of scores
  count = 0
  # Loop through similar users
  for u in similar_users.index:
    # If the movie has rating
    if pd.isna(movie_rating[u]) == False:
      # Score is the sum of user similarity score multiply by the movie rating
      score = similar_users[u] * movie_rating[u]
      # Add the score to the total score for the movie so far
      total += score
      # Add 1 to the count
      count +=1
  # Get the average score for the item
  item_score[i] = total / count

# Convert dictionary to pandas dataframe
item_score = pd.DataFrame(item_score.items(), columns=['movie', 'movie_score'])
    
# Sort the movies by score
ranked_item_score = item_score.sort_values(by='movie_score', ascending=False)

# Select top m movies
m = 10
ranked_item_score.head(m)

Step 9: Predict Scores (Optional)

If the goal is to choose the recommended items, having the rank of the items is enough. However, if the goal is to predict the user’s rating, we need to add the user’s average movie rating score back to the movie score.

# Average rating for the picked user
avg_rating = matrix[matrix.index == picked_userid].T.mean()[picked_userid]

# Print the average movie rating for user 1
print(f'The average movie rating for user {picked_userid} is {avg_rating:.2f}')

The average movie rating for user 1 is 4.39

The average movie rating for user 1 is 4.39, so we add 4.39 back to the movie score.

# Calcuate the predicted rating
ranked_item_score['predicted_rating'] = ranked_item_score['movie_score'] + avg_rating

# Take a look at the data
ranked_item_score.head(m)

We can see that the top 10 recommended movies all have predicted ratings greater than 4.5.

Summary

In this tutorial, we went over how to build a user-based collaborative filtering recommendation system. You learned

What is user-based (user-user) collaborative filtering?
How to create a user-product matrix?
How to process data for user-based collaborative filtering?
How to identify similar users?
How to narrow down the items pool?
How to rank items for the recommendation?
How to predict the rating score?

More tutorials are available on GrabNGoInfo YouTube Channel and GrabNGoInfo.com

Recommendation System: User-Based Collaborative Filtering

Step 0: User-Based Collaborative Filtering Recommendation Algorithm

Join Medium with my referral link - Amy @GrabNGoInfo

Read every story from Amy (and thousands of other writers on Medium). Your membership fee directly supports Amy and…

Step 1: Import Python Libraries

Step 2: Download And Read Data

Step 3: Exploratory Data Analysis (EDA)

Step 4: Create User-Movie Matrix

Step 5: Data Normalization

Step 6: Identify Similar Users

Step 7: Narrow Down Item Pool

Step 8: Recommend Items

Step 9: Predict Scores (Optional)

Summary

Recommended Tutorials

Join Medium with my referral link - Amy GrabNGoInfo

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…