avatarTien Nguyen

Summary

The blog post describes the fine-tuning process of Random Forest and XGBoost machine learning models to enhance time series forecasting accuracy in retail sales using the Root Mean Square Error (RMSE) metric.

Abstract

This article focuses on optimizing machine learning models, specifically Random Forest and XGBoost, for time series forecasting in the retail sales sector. It details the preprocessing of a sales dataset, the construction of the models, and the initial performance evaluation using RMSE. The author employs Python, pandas, scikit-learn, and XGBoost libraries to structure the code systematically, from data loading to model evaluation. A significant part of the work involves using Grid Search for hyperparameter tuning, which leads to improved RMSE values. The fine-tuning process, particularly for the XGBoost model, includes adjusting the number of estimators, maximum depth, and regularization parameters to prevent overfitting. The final results show that the XGBoost model, with an RMSE of ~2014, outperforms the Random Forest model and their ensemble, making it the preferred choice for its balance of complexity and predictive power.

Opinions

  • The author suggests that XGBoost is superior for time series forecasting in sales due to its lower RMSE and good trade-off between complexity and predictive power compared to Random Forest.
  • Fine-tuning, especially regularization parameter adjustments, is crucial for the robustness and reliability of predictive models in sales forecasting.
  • The ensemble method of averaging predictions from Random Forest and XGBoost is explored but does not yield the best results, indicating the individual strength of the fine-tuned XGBoost model.
  • The article implicitly recommends the use of ZAI.chat, an AI service presented as a cost-effective alternative to ChatGPT Plus (GPT-4), suggesting its utility and performance.
  • The complete code and data preprocessing steps are made available on the author's GitHub repository for those interested in replicating the results or using the methodology for their own projects.

Fine-Tuning Machine Learning Models for Time Series Forecasting in Retail Sales

Time series forecasting is a vital component in sales . Accurate forecasts enable businesses to make informed decisions and allocate resources more efficiently. This blog post delves into the journey of fine-tuning machine learning models, namely Random Forest and XGBoost, to achieve the best forecasting model based on Root Mean Square Error (RMSE).

The Dataset and Preprocessing

The dataset used for this project involves retail sales data with features like store type, location type, region code, and discount availability. After some necessary preprocessing steps like handling duplicates and converting date fields into usable formats, additional time-related features such as week, day, month, and year were extracted to improve the model’s performance.

Model Building and Initial Results

In this project, Python was the programming language of choice, utilizing libraries such as pandas for data manipulation, and scikit-learn and XGBoost for machine learning. The code was structured to facilitate a clear and logical progression from data loading to model evaluation. Both models underwent preprocessing steps that included one-hot encoding of categorical variables using scikit-learn’s ColumnTransformer and OneHotEncoder. The data was split into training and test sets for model validation. Here is a sample of Random Forest code

####Train random forest model
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
from math import sqrt

# Load the dataset
df1 = pd.read_csv('../input/sales-forecasting-womart-store/TRAIN.csv')
df= df1.drop_duplicates()

# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

# Extract time-related features
df['Week'] = df['Date'].dt.isocalendar().week
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year

# Drop the "ID" and "Date" columns since these columns make no contribue to prediction
df = df.drop(['ID', 'Date'], axis=1)

# Split the data into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Define categorical columns to be one-hot encoded along with new time-related features
categorical_cols = ['Store_Type', 'Location_Type', 'Region_Code', 'Discount', 'Week', 'Day', 'Month', 'Year', 'Holiday']

# Create a ColumnTransformer for one-hot encoding
preprocessor = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(drop='first', sparse=False), categorical_cols)],
    remainder='passthrough'
)

# Fit the ColumnTransformer on the training data and transform both training and test data
X_train = preprocessor.fit_transform(train_data.drop('Sales', axis=1))
X_test = preprocessor.transform(test_data.drop('Sales', axis=1))
y_train = train_data['Sales']
y_test = test_data['Sales']

# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=50, max_depth=5)
rf_model.fit(X_train, y_train)
predictions_rf = rf_model.predict(X_test)

# Evaluate the  model
rf_rmse = sqrt(mean_squared_error(y_test, predictions_rf))
print(f"Random Forest RMSE: {rf_rmse}")

Random Forest

  • Method: Ensemble of Decision Trees
  • RMSE: ~5795

XGBoost

  • Method: Gradient Boosting
  • RMSE: ~3220

Ensemble Random Forest and XGBoost

  • Method: Average of XGBoost and Random Forest
  • RMSE: ~4164

Fine-Tuning: The Road to Optimization

Each model underwent a rigorous fine-tuning process. The hyperparameter tuning method is called as Grid Search. In order to save computation time, I splitted the hyperparameter tuning into multiple stepsFor Random Forest, the number of estimators and maximum depth were optimized. For XGBoost, along with these parameters, regularization terms (reg_alpha and reg_lambda) were also fine-tuned to prevent overfitting. Here is a sample of tuning parameter for random forest

from sklearn.model_selection import GridSearchCV

# Step 1: Tune 'n_estimators' keeping 'max_depth' at a reasonable fixed value (e.g., 5)

# Create the parameter grid for 'n_estimators'
param_grid_1 = {'n_estimators': [50, 100, 200]}

# Create a based model
rf = RandomForestRegressor(max_depth=5)

# Instantiate the grid search model
grid_search_1 = GridSearchCV(estimator = rf, param_grid = param_grid_1, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search_1.fit(X_train, y_train)

# Get the best 'n_estimators' parameter
best_n_estimators = grid_search_1.best_params_['n_estimators']

# Step 2: Tune 'max_depth' using the best 'n_estimators' value found (which is 100)

# Create the parameter grid for 'max_depth'
param_grid_2 = {
    'max_depth': [5, 10, 15]
}

# Create a based model with the best 'n_estimators'
rf = RandomForestRegressor(n_estimators=best_n_estimators)  

# Instantiate the grid search model
grid_search_2 = GridSearchCV(estimator = rf, param_grid = param_grid_2, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search_2.fit(X_train, y_train)

# Get the best 'max_depth' parameter
best_max_depth = grid_search_2.best_params_['max_depth']

# Train the Random Forest model with best parameters
best_rf_model = RandomForestRegressor(n_estimators=100, max_depth=best_max_depth)
best_rf_model.fit(X_train, y_train)
predictions_best_rf = best_rf_model.predict(X_test)

# Evaluate the tuned model
best_rf_rmse = sqrt(mean_squared_error(y_test, predictions_best_rf))
print(f"Tuned Random Forest RMSE: {best_rf_rmse}")

Fine-Tuned Result

Tuned Random Forest

  • RMSE: ~3813

Tuned XGBoost

  • RMSE: ~2014

Ensemble tuned models

  • Method: weighted averaging the predictions from both Random Forest and XGBoost models.
  • RMSE: ~2390

Conclusion: Why XGBoost?

After all the experiments, the XGBoost model emerged as the winner with the lowest RMSE of around 2014. It provided a good trade-off between complexity and predictive power, although it is generally more complex and less interpretable compared to Random Forest. The fine-tuning of regularization parameters contributed significantly to this success, showcasing the model’s robustness and reliability for time-series forecasting in sales.

When accuracy is the cornerstone for your business decisions, XGBoost stands as the go-to model, making it the final choice for this project.

For those interested in diving into the complete code and data preprocessing steps, you can find everything in my GitHub Repository.

Machine Learning
Forecasting
Python
Time Series Analysis
Data Analysis
Recommended from ReadMedium