Fine-Tuning Machine Learning Models for Time Series Forecasting in Retail Sales
Time series forecasting is a vital component in sales . Accurate forecasts enable businesses to make informed decisions and allocate resources more efficiently. This blog post delves into the journey of fine-tuning machine learning models, namely Random Forest and XGBoost, to achieve the best forecasting model based on Root Mean Square Error (RMSE).

The Dataset and Preprocessing
The dataset used for this project involves retail sales data with features like store type, location type, region code, and discount availability. After some necessary preprocessing steps like handling duplicates and converting date fields into usable formats, additional time-related features such as week, day, month, and year were extracted to improve the model’s performance.
Model Building and Initial Results
In this project, Python was the programming language of choice, utilizing libraries such as pandas for data manipulation, and scikit-learn and XGBoost for machine learning. The code was structured to facilitate a clear and logical progression from data loading to model evaluation. Both models underwent preprocessing steps that included one-hot encoding of categorical variables using scikit-learn’s ColumnTransformer and OneHotEncoder. The data was split into training and test sets for model validation. Here is a sample of Random Forest code
####Train random forest model
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
from math import sqrt
# Load the dataset
df1 = pd.read_csv('../input/sales-forecasting-womart-store/TRAIN.csv')
df= df1.drop_duplicates()
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'])
# Extract time-related features
df['Week'] = df['Date'].dt.isocalendar().week
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
# Drop the "ID" and "Date" columns since these columns make no contribue to prediction
df = df.drop(['ID', 'Date'], axis=1)
# Split the data into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
# Define categorical columns to be one-hot encoded along with new time-related features
categorical_cols = ['Store_Type', 'Location_Type', 'Region_Code', 'Discount', 'Week', 'Day', 'Month', 'Year', 'Holiday']
# Create a ColumnTransformer for one-hot encoding
preprocessor = ColumnTransformer(
transformers=[('cat', OneHotEncoder(drop='first', sparse=False), categorical_cols)],
remainder='passthrough'
)
# Fit the ColumnTransformer on the training data and transform both training and test data
X_train = preprocessor.fit_transform(train_data.drop('Sales', axis=1))
X_test = preprocessor.transform(test_data.drop('Sales', axis=1))
y_train = train_data['Sales']
y_test = test_data['Sales']
# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=50, max_depth=5)
rf_model.fit(X_train, y_train)
predictions_rf = rf_model.predict(X_test)
# Evaluate the model
rf_rmse = sqrt(mean_squared_error(y_test, predictions_rf))
print(f"Random Forest RMSE: {rf_rmse}")Random Forest
- Method: Ensemble of Decision Trees
- RMSE: ~5795
XGBoost
- Method: Gradient Boosting
- RMSE: ~3220
Ensemble Random Forest and XGBoost
- Method: Average of XGBoost and Random Forest
- RMSE: ~4164
Fine-Tuning: The Road to Optimization
Each model underwent a rigorous fine-tuning process. The hyperparameter tuning method is called as Grid Search. In order to save computation time, I splitted the hyperparameter tuning into multiple stepsFor Random Forest, the number of estimators and maximum depth were optimized. For XGBoost, along with these parameters, regularization terms (reg_alpha and reg_lambda) were also fine-tuned to prevent overfitting. Here is a sample of tuning parameter for random forest
from sklearn.model_selection import GridSearchCV
# Step 1: Tune 'n_estimators' keeping 'max_depth' at a reasonable fixed value (e.g., 5)
# Create the parameter grid for 'n_estimators'
param_grid_1 = {'n_estimators': [50, 100, 200]}
# Create a based model
rf = RandomForestRegressor(max_depth=5)
# Instantiate the grid search model
grid_search_1 = GridSearchCV(estimator = rf, param_grid = param_grid_1,
cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search_1.fit(X_train, y_train)
# Get the best 'n_estimators' parameter
best_n_estimators = grid_search_1.best_params_['n_estimators']
# Step 2: Tune 'max_depth' using the best 'n_estimators' value found (which is 100)
# Create the parameter grid for 'max_depth'
param_grid_2 = {
'max_depth': [5, 10, 15]
}
# Create a based model with the best 'n_estimators'
rf = RandomForestRegressor(n_estimators=best_n_estimators)
# Instantiate the grid search model
grid_search_2 = GridSearchCV(estimator = rf, param_grid = param_grid_2,
cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search_2.fit(X_train, y_train)
# Get the best 'max_depth' parameter
best_max_depth = grid_search_2.best_params_['max_depth']
# Train the Random Forest model with best parameters
best_rf_model = RandomForestRegressor(n_estimators=100, max_depth=best_max_depth)
best_rf_model.fit(X_train, y_train)
predictions_best_rf = best_rf_model.predict(X_test)
# Evaluate the tuned model
best_rf_rmse = sqrt(mean_squared_error(y_test, predictions_best_rf))
print(f"Tuned Random Forest RMSE: {best_rf_rmse}")Fine-Tuned Result
Tuned Random Forest
- RMSE: ~3813
Tuned XGBoost
- RMSE: ~2014
Ensemble tuned models
- Method: weighted averaging the predictions from both Random Forest and XGBoost models.
- RMSE: ~2390
Conclusion: Why XGBoost?
After all the experiments, the XGBoost model emerged as the winner with the lowest RMSE of around 2014. It provided a good trade-off between complexity and predictive power, although it is generally more complex and less interpretable compared to Random Forest. The fine-tuning of regularization parameters contributed significantly to this success, showcasing the model’s robustness and reliability for time-series forecasting in sales.
When accuracy is the cornerstone for your business decisions, XGBoost stands as the go-to model, making it the final choice for this project.
For those interested in diving into the complete code and data preprocessing steps, you can find everything in my GitHub Repository.





