avatarSanjeev Yadav

Summary

The web content discusses a project using Apache Spark to analyze Spotify's user data in India to predict churn and optimize user retention through machine learning, with a focus on scalable data manipulation and feature engineering.

Abstract

The article outlines a data science project aimed at predicting user churn for Spotify in India, leveraging Apache Spark for handling big data. It details the creation of user-level features from interaction data to feed into machine learning models. The project is part of Udacity's Data Scientist Nanodegree program and involves processing a subset of a large dataset (12 GB on Amazon S3) to identify at-risk users for targeted incentives. The author has shared their GitHub repository for reproducibility and has emphasized the importance of the F1 score over accuracy due to data imbalance. Despite challenges such as imbalanced data and the learning curve of PySpark, the project demonstrates the potential of Spark in scaling data science tasks for large datasets.

Opinions

  • The author expresses enthusiasm for the data pre-processing phase of the project.
  • The author finds data visualization easier after converting Spark data frames to Pandas data frames using the toPandas() method.
  • There is a note of dissatisfaction with the official PySpark documentation compared to Pandas.
  • The author indicates a compromise in using common machine learning models over advanced ones for the sake of mastering Spark.
  • The author acknowledges the high cost of running complete analyses on large datasets using Amazon EMR, which influenced their decision to work with a smaller dataset for the project.
  • The author appreciates Udacity and Udacity India for providing a challenging project and praises the instructors' clarity in teaching functional programming in Python.

Spark for Big Data

Source

Overview

The news is that Spotify is coming to India. It has many challenges to face with the local rivals but one of the company-specific factors which decide its growth rate in the long term is its exceedance over churn rate.

So what is this churn rate anyway?

A churn rate is the percentage of subscribers to service who discontinue their subscriptions within a given time period. For a company to expand its clientele, its growth rate, as measured by the number of new customers, must exceed its churn rate.

In this article, we will be using Spark to perform scalable data manipulation and machine learning tasks for a subset (128 MB on my local machine) of a large data set (12 GB on Amazon S3).

To follow along in coming sections, please view my GitHub repository for directions and instructions.

Motivation

Source

All this data contains the key insights for keeping the users happy and helping the business thrive. It’s the responsibility of the data team to predict which users are at risk to churn: either downgrading from premium to free tier or cancelling their service altogether. If we can accurately identify these users before they leave, then the business can offer them discounts and incentives, potentially saving millions in revenue.

This is one of the Capstone projects that I opted for Udacity’s Data Scientist Nanodegree program.

I have been using scikit-learnfor more than a year now, but with Spark and its functional programming paradigm, we can scale our model for Big Data.

Data Source

Every time a user interacts with the service: whether playing songs, logging out, liking a song with a thumbs up, hearing an ad or downgrading their service, it generates data.

You can download the data set from GitHub(size is 128 MB), and know more about the features from the metadata file. If you want to extend this analysis and deploy your own Spark cluster on Amazon EMR (Elastic MapReduce), comment below and I will share the S3 bucket for the full data set(12GB).

Feature Engineering

Following user-level features were created from the existing ones:

  1. Dummy columns for gender
Male listeners are more

2. Songs per user

3. Sessions per user

4. Songs per session for each user

5. Average time per session for each user

6. Most recent status of the user: Free or Paid

7. The proportion of page visits for each user

View proportion for each page

To know more about the features mentioned above, see the metadata file.

8. Number of artists user has listened to

9. Region-level location information

Modeling

This is a binary classification task. I have used 3 classification algorithms from PySpark’s ml.classification library:

  1. Logistic Regression
  2. Ensemble methods like Random Forest Classifier, and
  3. Gradient Boosting Tree Classifier

and some are still in development.

Improvement

Since the data is highly imbalanced, accuracy is not the right metric for evaluation. Instead, I used the F1 score and performed undersampling to further optimize it.

GBTClassifier provided a fairly good F1 score of 0.64 after undersampling. This post explains why you should use F1 score for evaluating your model on an imbalanced data set.

Analyzing Important Features

Using featureImportances method from pyspark.ml.GBTClassifier shows the relevance of each feature used in training.

pSubmitDowngrade is at the top because it is somewhat correlated with the target label.

Features that start with “p” are different types of pages the user visited.

Conclusion

I enjoyed the data pre-processing part of the project. For data visualisation part, instead of using “for” loops to get arrays for our bar chart, I first converted our Spark data frame to Pandas data frame using toPandas()method. Data visualisation is easier from then on.

The shape of our final data is just 225 x 32. This is too small to generalise our model. Just 225 users for a streaming company? That’s nothing. 12 GB data might provide some useful results. If you want more statistically significant results then I suggest you run this notebook on Amazon EMR for the 12 GB dataset. I have skipped that part for now because it costs $30 for one week.

Some of the challenges which I faced in this project are:

  • Official documentation of PySpark as compared to Pandas
  • For the sake of mastering Spark, we only used the most common machine learning classification models instead of using the advanced ones
  • Highly imbalanced data led to a poor F1 score
  • If you run this notebook without any changes, then it will take around an hour to run completely

Acknowledgements

Thanks to Udacity and Udacity India for providing such a challenging project to work on. At first, I was vexed with functional programming in Python, but the instructors were very clear in their approach. Now I am looking for more projects that are built atop Spark.

Big Data
Spark
Customer Churn
Streaming
Data Analysis
Recommended from ReadMedium