Data Science Project Tutorial: How to Build a Recommender System to Suggest What to Wear on the Run

I’m a runner. I live in Chicago, so we get all four seasons, sometimes within the span of one week. I hate the treadmill, and prefer to run outside as much as possible. Yes, that is me above, running a 5K race when it was 3°F.

Every winter, I seem to forget how to dress for the plunging temps, even though I’ve done it for the past 10+ years. There are websites out there like DressMyRun.com, but they offer generic recommendations on the number of layers but not specifics on what type of layers, thickness, material, etc. I want to know which items from my own running wardrobe work for different types of weather conditions.

So I created my own Recommender System.

If you’re not familiar with Recommender Systems, it’s using a machine learning model to rank or sort predictions. It’s how Netflix recommends movies and TV shows or Amazon recommends products.

Read on for how I created my own Recommender System to suggest what I should wear based on the current weather conditions.

Note: This tutorial assumes that you are familiar with:

Python programming language
Dataframes, specifically using Pandas
Visualization, I used Seaborn but any viz package is fine
Descriptive statistics
Boolean values, Dummy variables, One-Hot Encoding
Data normalization
Basic understanding of matrices
Nearest Neighbors Algorithms

Step One: Log What I Wear

A Recommender System needs data to make recommendations. So, I needed to create my own data set of what I wore for each run, along with the weather conditions.

I kept my log in a Google Spreadsheet, but you could also use a Microsoft Excel file. I kept track of:

Temp = The air temperature during my run.
Feels Like = The “feels like” temperature during my run.
Wind = The wind speed during my run.
Sky = “Sunny” or “Cloudy or Night”
Precip = “None” “Light” or “Heavy”
Mi = The distance I ran in miles
Type = The type of run — “Easy” “Long” “Tempo” “Intervals” etc
How I Felt = “Perfect” “Warm” or “Cold”
Top 1 and Top 2 — The layers I wore on top — shirts, jackets
Bottom 1 and Bottom 2 — The layers I wore on bottom — tights, pants
Head Warmth, Head Hat, Hands — If I wore a headband, hat, or gloves

How much data do you need? Obviously, the more data you have to fit your model, usually the more accurate your output. However, I am just one person and can only run so much. My Recommender System will improve over time as I run more and log more data.

Step Two: Import and Clean the Data

I used Python in a Jupyter Notebook to build my Recommender System. You could use Python or R in another environment depending on what you are more comfortable with.

The first time I imported my data, I did a lot of cleaning, because it wasn’t organized as I have it listed above. I realized it was easier to change the way I logged the data than it was to do a lot of cleaning via Python code. This is an important lesson! Better to improve how you collect the data at the source than to try to clean it after it’s been collected and stored. (We are not always so lucky that we can change the way the data is collected though.)

What I typically look at when cleaning data:

Is the data within each column in the same format?
Are categorical values spelled the same and use upper and lower case letters consistently?
Are there missing values, and if so, how should they be handled?
Are there categorical variables that need to be numeric for the sake of your model? Distance-based models, like KNN, need all numeric values, whereas tree-based models can handle categorical variables. If you need to transform data, convert them to boolean values using Dummy variables (example below) or One-Hot Encoding.
Are there any outlier values? How should they be handled?
Does your data need to be normalized? Do you want all numeric values to be on the same scale? (Such as 0 to 1 or -1 to 1.)

I opted not to normalize my data, even though for distance models, like Nearest Neighbor models, you typically do want to normalize your data. Keeping my data as-is means that temperature will have a bigger impact on calculating predictions than wind and any values that were converted to boolean values, since the values of temperate range from 20 to 50, whereas wind ranges from 5 to 15 and boolean values are 0 or 1. That was acceptable to me for the sake of how I want my recommender to weigh certain values, but you might want to try a model with normalized data and see how that changes the predictions.

Step Three: Explore Your Data

It’s important to familiarize yourself with your dataset before you start building your model.

These aren’t all of the visualizations that I looked at, but just a snapshot of what to consider for each numeric variable:

Distribution via Histograms and Boxplots
Correlations via Heatmap and scatterplots

Additionally, look at descriptive statistics:

After doing data exploration and visualization, you may find more things to clean or transform. You might realize some columns have outliers, depending on the model you use, and you might want to transform that variable. Again — consider if you’re using a distance or tree-based model and how an outlier might impact predictions.

You might also decide to drop some variables from your dataset, maybe if two columns are very strongly correlated, it doesn’t make sense to keep both. Additionally, you might want to create new calculated fields, such as a rate or percentage.

Step Four: Fit your model and generate recommendations

Finally, it’s time to actually do the modeling. I used the Nearest Neighbors algorithm from the scikit-learn package.

Implementing a model using a package like scikit-learn may seem deceptively easy, but there is a lot of math happening when you run the code. This is why it is important to be comfortable with various mathematical concepts. At a very high level, a nearest neighbors algorithm is calculating distances between the points in your dataset to find which observations are closest — or the “nearest neighbor.” The way the recommender system works is by taking a set of values that you input, finding the mathematically “nearest neighbor” within your dataset, and then using the index (row number) of that neighbor to return the values you want to recommend.

In this case, what we have done:

Isolate the data necessary to find the nearest neighbor — in this case, the weather-related values.
Fit the NearestNeighbors algorithm to this isolated dataset.
Run the algorithm on our inputted data (the current weather conditions) to find the nearest neighbor and return the index (row number).
Find the values I want to recommend from that index (row), and print them via this function, which gives us the “You should wear” output shown above.

You can view the entire notebook via GitHub.

Obviously, running a Jupyter Notebook each time I want to go for a run isn’t very convenient, so a future iteration will be to automate this process a bit more, but that’s a future tutorial.

What do you think? Any questions, comments, or feedback?

Check out the rest of my data analytics & career resources and sign up for my weekly newsletter with tips for a career in analytics.