avatarPatrick Brus

Summary

This article presents a full data science portfolio project that performs Retail Data Analytics using Amazon Web Service and various Machine Learning Algorithms, with the goal of predicting department-wide weekly sales for a store to optimize manufacturing processes and increase income while lowering costs.

Abstract

The article introduces a data science project aimed at predicting weekly sales for retail stores using machine learning techniques. The project uses the RDA dataset from Kaggle, which contains historical sales data from 45 stores. The dataset includes features such as store type, size, and sales data collected from 2010 to 2012. The project explores various machine learning algorithms, including Linear Regression, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor, to find the best-performing model for predicting weekly sales. The article also discusses data preprocessing, training, and evaluation of the models, with the XGBoost Regressor being the best-performing model.

Opinions

  • The article presents a comprehensive data science project that uses machine learning techniques to predict weekly sales for retail stores.
  • The project uses the RDA dataset from Kaggle, which contains historical sales data from 45 stores, and explores various machine learning algorithms to find the best-performing model.
  • The article highlights the importance of data preprocessing and training in building accurate machine learning models.
  • The XGBoost Regressor is identified as the best-performing model for predicting weekly sales, with a more than ten times larger R-Square score than the benchmark model.
  • The article emphasizes the need to consider potential outliers in the prediction, as the range of weekly sales is very large.
  • The project aims to optimize manufacturing processes and increase income while lowering costs for retail stores.
  • The article provides a detailed explanation of the project, including data exploration, algorithms and techniques, implementation, and results, making it a valuable resource for data science enthusiasts and professionals.

Retail Data Analytics

A Data Science Portfolio Project using AWS

Photo by Lukas Blazek on Unsplash

Introduction

In this article I want to present you a full data science portfolio project. In this project I want to perform Retail Data Analytics using the Amazon Web Service and different Machine Learning Algorithms. The full code including a project proposal and a final project report can be found in my Github repository.

Definition

Retail Data Analytics (RDA) is used nowadays from shops in order to better predict the amount of articles, that might get sold and therefore to better estimate how much articles should be produced. This is very important, because the amount of sold articles can vary largely during the year. For example people are tend to buy more things before Christmas then during a normal, not holiday, week. This can be easily seen on the Amazon quarterly revenue on Statista. The quarterly revenue of Amazon is always the largest for the fourth quarter, which indicates, that the people are consuming more during the fourth quarter than during the others. This is clear due to the fact that Christmas is within the fourth quarter and also Black Friday, which leads to large worldwide consume, too. If a shop has too few products before Christmas, potential income would get lost. But if a shop has too much products, too much storage would be required and storage costs money, so the company would again loose money. RDA can therefore be used in order to try to optimize the production of products, such that there is always an optimal amount available.

Problem Statement

The goal of this project is to predict the department wide weekly sales for a store. This should then help to optimize the manufacturing process and therefore to increase income while lowering costs. It should be possible to feed in past sales data from a department and to get the predicted weekly sales.

Metrics

The target data is numerical data. Therefore, root mean squared error (RMSE) can be used in order to get the best performing machine learning technique. Equation 1 shows the formula for computing the RMSE. In order to get a better understanding of the scale of RMSE, the normalized RMSE can be used. The normalized RMSE is computed by dividing the RMSE by the standard deviation of the target variable. Additionally, the R-Squared score can be used. The R-Squared score represents the percentage of variance a model can represent. The R-Squared score computes the squared error of each point divided by the “dummy” estimator error, which is only predicting the mean of all target variables. Equation 2 shows the formula for computing R-Squared.

Equation 1: Root Mean Squared Error Equation
Equation 2: R-Squared Equation

Data Exploration

The RDA dataset from Kaggle is used for this project. The dataset contains historical sales data from 45 stores located in different regions. Each store is further divided into departments. The data itself is stored in an excel sheet. The excel sheet contains three tabs. The first tab contains the data from the stores. The second contains the features and the third contains the sales data.

Stores

There is data of 45 stores in total. Every store has its own type and size, which is also included in the excel sheet. The information contained in the excel sheet is anonymized. Figure 1 shows the statistics of the size from the stores. In total there are three different types of stores (A, B, and C).

Figure 1: Statistics of Store Size

Features

The features are related to a store. Figure 2 contains all available features and a short description of each one, while figure 3 contains the statistics to some of the features. The data for MarkDown1 — MarkDown5 is very incomplete and has to be dropped or different methods for handling missing data have to be applied.

Figure 2: Features of the Dataset
Figure 3: Statistics of Features

Sales

Each store also has historical sales data stored in the dataset. The sales data was collected from the fifth February 2010 until the first November 2012. Figure 4 contains all features related to the sales data and figure 5 contains some statistics of the weekly sales.

Figure 4: Features of Sales Data
Figure 5: Statistics of Sales Data

Exploratory Visualization

Figure 6: Historical Data of Features

This chapter takes a more detailed view on the data and all features. Figure 6 shows a time series analysis of some features. The first row contains the plot for the temperature. The temperature is alternating year by year having its maximum somewhere around July and its minimum somewhere in January. The temperature itself doesn’t seem to have any influence on the weekly sales. The second row of the figure contains the fuel price. The fuel price had a large increase from January 2011 to July 2011. Afterwards, the fuel price is oscillating up and down. The third row of the figure contains the Consumer Price Index (CPI), which was steadily increasing since begin of the time series. In the next row one can see the Unemployment rate, which is steadily decreasing since the beginning of the time series. The functions of CPI and Unemployment rate make totally sense, because when the people have more jobs, they have more money to buy things and therefore the CPI goes up, because the more demand, the larger the prices. The fifth row of the figure contains the boolean data of whether its a holiday week or not. The last row contains the weekly sales data. As one can see, the weekly sales data is not automatically larger if the week is a holiday week. The peaks of weekly sales are in November and December. The peak in November is possibly due to Black Friday, while the peak in December is possibly due to Christmas. In January the weekly sales are the lowest. Figure 7 shows the stores analysis. The plot on the left shows how many stores per type are available. The stores of type “A” are the most common, while stores of type “C” are the least common. The plot in the middle shows a box plot of the weekly sales per store type. Store type “A” has the highest median of weekly sales, while type “C” has the lowest. This makes totally sense if one takes a look at the right plot, which shows a box plot of store sizes per store type. There, one can see that store type “A” is the largest, while store type “B” the second largest and “C” the smallest. So the store size directly correlates with the amount of weekly sales.

Figure 7: Store Analysis

Algorithms and Techniques

In order to solve the Problem, machine learning shall be applied. Different state of the art machine learning techniques should be applied and the best performing should be used for the final application. Within the final application, an user should be able to enter a week and a store of interest and to get the predicted weekly sales as output. As Machine Learning algorithms, the following ones shall be applied and evaluated: 1. Linear Regression 2. Decision Tree Regressor 3. Random Forest Regressor 4. XGBoost Regressor The final model should be able to follow the pattern of weekly sales. It should be able to detect the peaks around Black Friday and Christmas and the low values in January.

Linear Regression

Linear Regression is a machine learning technique, where its tried to find a linear function combining all input features in order to get the target value. It’s therefore assumed that the relationship between input features and the target value is linear.

Decision Tree Regressor

A decision tree can represent linear and non-linear function and tries to build a tree of rules. Each node in the tree contains a condition and each branch the outcome. It can be thought of a simple “if condition then branch1 else branch2” structure. The decision tree algorithm tries to build these rules in a way that the outcome performance can be optimized. When a prediction should be created, a “walk” through the nodes and branches of the decision tree is made and the final node contains the prediction of the tree. Decision trees are very popular because of their easy interpretability. The structure of the tree can be easily visualized and the user can directly see the node conditions and the branches and can therefore easily follow the strategy, how the decision tree model creates the prediction.

Random Forest Regressor

A random forest creates more decision trees at training time in order to prevent overfitting. During inference time, a random forest then outputs the mean of all decision tree predictions in order to get the final prediction.

XGBoost Regressor

The XGBoost Regressor algorithm is an optimized version of the random forest. It uses gradient boosting technqiues in order to increase training speed and prediction performance.

Benchmark

The linear regression model is used as benchmark model. It is trained using the scikit-learn library. It has a RMSE of 21469.1 weekly sales, which is very large and tells us, that the weekly sales are not linearly dependent on the input features. Figure 8 shows 100 predictions and their true labels. As one can see, the benchmark model is not able to follow the structure of weekly sales. Figure 11 shows the code for training the benchmark model, deploying it and then getting the evaluation results.

Figure 8: Results of Linear Regression Model

Implementation

In this chapter I want to explain the main procedures for pre-processing the data and training all estimators. I’ve attached code snippets for more clearness.

Data Preprocessing

The notebook “2_Create_Train_and_Test_Data.ipynb” contains the code for data preprocessing and creating the training, testing and validation data. As first pre-processing step the column “IsHoliday” is converted to integer from boolean. Secondly, the store type is converted to a categorical column. Store type “A” is converted to 0, “B” to 1 and “C” to 2. As next step, the date column is converted to two separate features. The first one is the year, the second one is the week of the year. The week of the year shall be used, because the weekly sales are to be predicted. Afterwards, the high amount of missing values for the markdown columns shall be tackled. In order to remove the missing values, the iterative imputer from sklearn library is used. The iterative imputer models each missing value as function of other features and uses that estimate for imputation. The features and the available values of the markdown columns are used to train a regression model. The missing values are then predicted using the trained regression model. The iterative imputer is only trained on training data in order to avoid data leakage into the hold-out test-and validation data. Figure 9 shows the code for training an iterative imputer and then imputing the missing values.

In the end, the train, test and validation dataset is created and stored in separate csv files. The test set size makes 10% of all data, while the validation set size makes 30% of the remaining data.

Training

The notebook “3_Training_and_Deployment.ipynb” contains all training code. I started with loading the training, validation and test data to S3 (Figure 10).

Then I trained the benchmark model. The linear regression benchmark model is created and trained using the sklearn library. The code for training is located in the “train_linear_regression.py” script within the source folder, which is also used as entry point for the sagemaker sklearn estimator object in the notebook. Afterwards the benchmark model was deployed and evaluated. In the evaluation, I computed the normalized RMSE, the R-Square score and created a plot with the first 100 test predictions (Figure 11). Additionally, I plotted the first 100 predictions vs. the ground truth values using the matplotlib library.

After that, I trained the decision tree regressor and a random forest regressor. For both, I used the sklearn library and both have separate training scripts in the source folder as entry point. A decision tree and a random forest have a lot of different hyperparameters. In order to find the best matching, I used the skopt library for creating a Bayesian optimizer. The Bayesian optimizer takes as input a sklearn estimator object, a dictionary with hyperparameter names as key and their value ranges as value, an integer indicating how much iterations should be performed and the number of folds used for cross-validation. Figure 12 contains the code for finding the optimal decision tree regressor. I used a cross-validation fold size of 5 and the Bayesian optimizer ran for 7 iterations.

Figure 13 contains the code for finding the optimal random forest regressor. The cross-validation fold size was again 5 and the Bayesian optimizer ran for 10 iterations. The random forest regressor additionally uses all four cpu cores in order to speed up the training.

For both trained models, I computed the normalized RMSE and the R-Square score. And as last estimator, I trained the XGBoost regressor. The XGBoost regressor is trained using the already created container from Sagemaker. Again, the XGBoost estimator has a lot of hyperparameters. In order to find the best combinations, I used the “HyperparameterTuner” library from Sagemaker. Figure 14 contains the code for finding the optimal XGBoost Regressor and evaluating it. I trained 20 different XGBoost estimators and fetched and deployed the best performing. After deployment, I again computed the normalized RMSE and the R-Square score, plus the visualization of the first 100 test predictions.

Results

Figure 15: Final Results

Figure 16shows the first 100 predictions of the trained decision tree regressor compared to the ground truth labels. As one can see, the decision tree regressor is already able to follow the pattern of weekly sales. Figure 15 contains the RMSE and the R-Square score.

Figure 16: Results of Decision Tree Model

Figure 17 shows the first 100 predictions of the trained random forest model. The random forest is again able to follow the structure of weekly sales, but the random forest regressor has a slightly larger normalized RMSE while having the same R-Square as the decision tree model.

Figure 17: Results of Random Forest Model

Figure 18 shows the first 100 predictions of the trained XGBoost model which had the best hyperparameter combinations according to the bayesian optimization. The XGBoost model also has the lowest normalized RMSE and the highest R-Square score and is therefore the best perfomring estimator of all trained models.

Figure 18: Results of XGBoost Model

The XGBoost model has a more than ten times larger R-Square score than the benchmark model. The normalized RMSE is five times smaller than the RMSE of the benchmark model. The R-Square score is at 0.96 which means that the final model can represent 96% of the variance given in the test data set. The RMSE is still at 19% of the standard deviation from the weekly sales. But the range of weekly sales is also very large, with the minimum weekly sales at -4989 and the maximum weekly sales at 693099.4 (c.f. Figure 5). When taking a look at Figure 18, one can see that the final model has learned the underlying structure and can therefore be used as predictor, but one still has to take potential outliers in the prediction under consideration.

Data Science
Machine Learning
AWS
Scikit Learn
Python
Recommended from ReadMedium