Retail Data Analytics
A Data Science Portfolio Project using AWS
Introduction
In this article I want to present you a full data science portfolio project. In this project I want to perform Retail Data Analytics using the Amazon Web Service and different Machine Learning Algorithms. The full code including a project proposal and a final project report can be found in my Github repository.
Definition
Retail Data Analytics (RDA) is used nowadays from shops in order to better predict the amount of articles, that might get sold and therefore to better estimate how much articles should be produced. This is very important, because the amount of sold articles can vary largely during the year. For example people are tend to buy more things before Christmas then during a normal, not holiday, week. This can be easily seen on the Amazon quarterly revenue on Statista. The quarterly revenue of Amazon is always the largest for the fourth quarter, which indicates, that the people are consuming more during the fourth quarter than during the others. This is clear due to the fact that Christmas is within the fourth quarter and also Black Friday, which leads to large worldwide consume, too. If a shop has too few products before Christmas, potential income would get lost. But if a shop has too much products, too much storage would be required and storage costs money, so the company would again loose money. RDA can therefore be used in order to try to optimize the production of products, such that there is always an optimal amount available.
Problem Statement
The goal of this project is to predict the department wide weekly sales for a store. This should then help to optimize the manufacturing process and therefore to increase income while lowering costs. It should be possible to feed in past sales data from a department and to get the predicted weekly sales.
Metrics
The target data is numerical data. Therefore, root mean squared error (RMSE) can be used in order to get the best performing machine learning technique. Equation 1 shows the formula for computing the RMSE. In order to get a better understanding of the scale of RMSE, the normalized RMSE can be used. The normalized RMSE is computed by dividing the RMSE by the standard deviation of the target variable. Additionally, the R-Squared score can be used. The R-Squared score represents the percentage of variance a model can represent. The R-Squared score computes the squared error of each point divided by the “dummy” estimator error, which is only predicting the mean of all target variables. Equation 2 shows the formula for computing R-Squared.


Data Exploration
The RDA dataset from Kaggle is used for this project. The dataset contains historical sales data from 45 stores located in different regions. Each store is further divided into departments. The data itself is stored in an excel sheet. The excel sheet contains three tabs. The first tab contains the data from the stores. The second contains the features and the third contains the sales data.
Stores
There is data of 45 stores in total. Every store has its own type and size, which is also included in the excel sheet. The information contained in the excel sheet is anonymized. Figure 1 shows the statistics of the size from the stores. In total there are three different types of stores (A, B, and C).

Features
The features are related to a store. Figure 2 contains all available features and a short description of each one, while figure 3 contains the statistics to some of the features. The data for MarkDown1 — MarkDown5 is very incomplete and has to be dropped or different methods for handling missing data have to be applied.


Sales
Each store also has historical sales data stored in the dataset. The sales data was collected from the fifth February 2010 until the first November 2012. Figure 4 contains all features related to the sales data and figure 5 contains some statistics of the weekly sales.


Exploratory Visualization

This chapter takes a more detailed view on the data and all features. Figure 6 shows a time series analysis of some features. The first row contains the plot for the temperature. The temperature is alternating year by year having its maximum somewhere around July and its minimum somewhere in January. The temperature itself doesn’t seem to have any influence on the weekly sales. The second row of the figure contains the fuel price. The fuel price had a large increase from January 2011 to July 2011. Afterwards, the fuel price is oscillating up and down. The third row of the figure contains the Consumer Price Index (CPI), which was steadily increasing since begin of the time series. In the next row one can see the Unemployment rate, which is steadily decreasing since the beginning of the time series. The functions of CPI and Unemployment rate make totally sense, because when the people have more jobs, they have more money to buy things and therefore the CPI goes up, because the more demand, the larger the prices. The fifth row of the figure contains the boolean data of whether its a holiday week or not. The last row contains the weekly sales data. As one can see, the weekly sales data is not automatically larger if the week is a holiday week. The peaks of weekly sales are in November and December. The peak in November is possibly due to Black Friday, while the peak in December is possibly due to Christmas. In January the weekly sales are the lowest. Figure 7 shows the stores analysis. The plot on the left shows how many stores per type are available. The stores of type “A” are the most common, while stores of type “C” are the least common. The plot in the middle shows a box plot of the weekly sales per store type. Store type “A” has the highest median of weekly sales, while type “C” has the lowest. This makes totally sense if one takes a look at the right plot, which shows a box plot of store sizes per store type. There, one can see that store type “A” is the largest, while store type “B” the second largest and “C” the smallest. So the store size directly correlates with the amount of weekly sales.

Algorithms and Techniques
In order to solve the Problem, machine learning shall be applied. Different state of the art machine learning techniques should be applied and the best performing should be used for the final application. Within the final application, an user should be able to enter a week and a store of interest and to get the predicted weekly sales as output. As Machine Learning algorithms, the following ones shall be applied and evaluated: 1. Linear Regression 2. Decision Tree Regressor 3. Random Forest Regressor 4. XGBoost Regressor The final model should be able to follow the pattern of weekly sales. It should be able to detect the peaks around Black Friday and Christmas and the low values in January.
Linear Regression
Linear Regression is a machine learning technique, where its tried to find a linear function combining all input features in order to get the target value. It’s therefore assumed that the relationship between input features and the target value is linear.
Decision Tree Regressor
A decision tree can represent linear and non-linear function and tries to build a tree of rules. Each node in the tree contains a condition and each branch the outcome. It can be thought of a simple “if condition then branch1 else branch2” structure. The decision tree algorithm tries to build these rules in a way that the outcome performance can be optimized. When a prediction should be created, a “walk” through the nodes and branches of the decision tree is made and the final node contains the prediction of the tree. Decision trees are very popular because of their easy interpretability. The structure of the tree can be easily visualized and the user can directly see the node conditions and the branches and can therefore easily follow the strategy, how the decision tree model creates the prediction.
Random Forest Regressor
A random forest creates more decision trees at training time in order to prevent overfitting. During inference time, a random forest then outputs the mean of all decision tree predictions in order to get the final prediction.
XGBoost Regressor
The XGBoost Regressor algorithm is an optimized version of the random forest. It uses gradient boosting technqiues in order to increase training speed and prediction performance.
Benchmark
The linear regression model is used as benchmark model. It is trained using the scikit-learn library. It has a RMSE of 21469.1 weekly sales, which is very large and tells us, that the weekly sales are not linearly dependent on the input features. Figure 8 shows 100 predictions and their true labels. As one can see, the benchmark model is not able to follow the structure of weekly sales. Figure 11 shows the code for training the benchmark model, deploying it and then getting the evaluation results.

Implementation
In this chapter I want to explain the main procedures for pre-processing the data and training all estimators. I’ve attached code snippets for more clearness.
Data Preprocessing
The notebook “2_Create_Train_and_Test_Data.ipynb” contains the code for data preprocessing and creating the training, testing and validation data. As first pre-processing step the column “IsHoliday” is converted to integer from boolean. Secondly, the store type is converted to a categorical column. Store type “A” is converted to 0, “B” to 1 and “C” to 2. As next step, the date column is converted to two separate features. The first one is the year, the second one is the week of the year. The week of the year shall be used, because the weekly sales are to be predicted. Afterwards, the high amount of missing values for the markdown columns shall be tackled. In order to remove the missing values, the iterative imputer from sklearn library is used. The iterative imputer models each missing value as function of other features and uses that estimate for imputation. The features and the available values of the markdown columns are used to train a regression model. The missing values are then predicted using the trained regression model. The iterative imputer is only trained on training data in order to avoid data leakage into the hold-out test-and validation data. Figure 9 shows the code for training an iterative imputer and then imputing the missing values.
In the end, the train, test and validation dataset is created and stored in separate csv files. The test set size makes 10% of all data, while the validation set size makes 30% of the remaining data.