Unleashing an End-to-End Predictive Model Pipeline: A Step-by-Step Guide

A Detailed ML Ops Pipeline for an End-to-End Predictive Model for Tabular Data

This post provides a comprehensive guide on building an end-to-end predictive model pipeline for tabular data using XGBoost. The step-by-step implementation includes all essential stages of the ML Ops pipeline, such as data preparation, feature engineering, hyperparameter tuning, model explainability, and model monitoring. With the help of code snippets, you can easily follow along and implement this pipeline in your own projects. By the end of this guide, you will have a solid understanding of how to build a robust predictive model pipeline using XGBoost for tabular data.

Outlined below are the high-level steps involved in building an end-to-end predictive ML model:

Data preparation: Organizing the data in a suitable format for analysis and modeling. This includes sorting the data by timestamp, handling missing values, outliers and identifying any seasonal patterns or trends
Data analysis and visualization: Exploring the data to understand the trends and patterns for the predictors and target variables. This includes visualizing the data, calculating descriptive statistics, and identifying any correlations or dependencies between variables.
Feature engineering: Focusing on creating informative and relevant features by selecting only the relevant variables in the dataset.
Scaling the independent variables: Scaling the independent variables to ensure they are on the same scale improves the accuracy of the model. This can be done using techniques such as standardization or normalization.
Hyperparameter tuning: Tuning hyperparameters before training the model, such as the learning rate or the number of hidden layers, can improve the performance of the model. The code will include methods to automate this process.
Training and evaluation of the model: Training and evaluating different models by splitting the data into training and testing sets. The model’s performance will be evaluated using metrics such as the mean squared error or accuracy, precision, and recall.
Model Monitoring: Ensures that the model remains accurate and reliable over time. Models can degrade in their predictive performance over time due to changing data patterns referred to as data drift or shifts in the underlying relationships between features and the target variable also referred to as concept drift. If left unchecked, these changes can result in degraded model performance and inaccurate predictions.
Model Serving or Model Deployment: Frequently and continuously deploying the machine learning models to ensure optimal performance. It’s important to choose an appropriate deployment strategy based on the use case for scalability, security, and performance. Deployments can be web-based on a cloud platform or on an edge device that handles either real-time or batched data.
Model Explainability: Interpreting the machine learning model’s workings and how it makes predictions to build trust in the model and ensure that the decisions made by the model are fair and ethical. This can be achieved through various methods such as SHAP values, Decision trees, LIME, partial dependence plots, and feature importance analysis.

High-level steps for an end-to-end ML Model

Outlined below are the high-level steps for building an end-to-end ML model using UCI Air Quality Dataset and you will predict the concentration of benzene in the air based on the measurements of other pollutants and meteorological data using XGBoost.

Data Preparation

Organize and prepare the data by analyzing it using data visualization to identify trends and patterns
Handle any missing data and outliers and scale the data
Split the data into train and test for the independent and target variables

Feature Selection

Run the XGBoost with default hyperparameters to identify the most important features using feature selection

Hyperparameter tuning

Perform hyperparameter tuning based on feature importance to optimize model performance.

Train the Model

Train the XGBoost model using the best parameters based on hyperparameters and feature selection.

Evaluate Model Performance

Evaluate the trained model to assess its performance.

Model Monitoring

Monitor the model’s performance on live data and retrain the model based on performance to ensure that it remains accurate and reliable over time.

Full code available on GitHub

Data Preparation

Import required libraries

import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
from xgboost import plot_importance, plot_tree
from sklearn.metrics import mean_squared_error, mean_absolute_error

Loading Data for Analysis

Data Analysis involves exploring and understanding the data to identify trends and patterns. Organizing the data for analysis and modeling is a critical step in the data analysis process.

Data can be read from an Excel file, however, depending on the data type and volume, there are other file formats that may be more suitable such as CSV or JSON.

df= pd.read_excel('AirQualityUCi.xlsx')

Exploratory Data Analysis

Different techniques can be applied to understand the data, such as visualizing the data, calculating descriptive statistics, identifying correlations and dependencies between variables, and detecting any outliers or missing values.

Analyzing the data using descriptive statistics

df.describe() displays the summary of statistics for all the columns in the dataset. It shows the count of data points for the column, the mean of the column, the standard deviation in the column, and IQR (interquartile range) values

df.describe()

`df.describe():` Displays the summary of statistics for the columns in the dataset giving the mean, std, and IQR values

Understand the Data

df.info() prints a concise summary of a data frame including the index the datatype and columns, non-null values, and the memory usage.

df.info()

Preparing the data for the Model

Preparing the data for the model by concatenating the date and time and setting an index for the data frame.

df['Datetime']= pd.to_datetime(df.Date.astype(str)+' '+df.Time.astype(str))
df = df.set_index(df['Datetime'])

Visualizing the data

Visualizing the target variable to identify any seasonal patterns or trends using plots

plt.plot(df['C6H6(GT)'])
#plt.gcf().autofmt_xdate()
plt.show()

The plot shows that there are a few data points that are significantly different from the rest of the data in a dataset and can be considered outliers.

Outlier Detection

The code below identifies the outlier for the target variable.

df_anomaly=pd.DataFrame()
df_anomaly=df.copy()
# Calculate IQR for target variable
q1, q3 = df['C6H6(GT)'].quantile([0.25, 0.75])
iqr = q3 - q1

# Calculate upper and lower bounds for outlier for C6H6(GT)
lower_val = q1 - (1.5*iqr)
upper_val = q3 + (1.5*iqr)

# Filter out the outliers from the C6H6(GT)
df_anomaly['anomaly_C6H6'] = ((df['C6H6(GT)']>upper_val) | (df['C6H6(GT)']<lower_val)).astype('float')
# Let's plot the outliers and see where they occured in the time series
a = df_anomaly[df_anomaly['anomaly_C6H6'] == 1] #anomaly
_ = plt.figure(figsize=(18,6))
_ = plt.plot(df_anomaly['C6H6(GT)'], color='blue', label='Normal')
_ = plt.plot(a['C6H6(GT)'], linestyle='none', marker='X', color='red', markersize=12, label='Anomaly')
_ = plt.xlabel('Date and Time')
_ = plt.ylabel('Benzene levels')
_ = plt.title('Benzene Anomalies')
_ = plt.legend(loc='best')
plt.show()

Plotting anomalies for the target variable

Outliers can have a significant impact on the performance of a machine-learning model. Therefore, handling outliers is an important step in data preprocessing.

The first step in handling outliers is investigating their origin, whether they are data entry errors or genuine anomalies indicating a significant event. Once the origin of the outliers is understood, there are different options to handle them:

Remove outliers from the dataset: This can be done based on a threshold or control limits for the variable. However, this option should be chosen carefully, as losing data can negatively impact the model training process.
Transform the data: Applying an appropriate mathematical transformation function like mean, median, or standard deviation can help to transform the data and remove the outliers.
Investigate the root cause of the outliers: Further analysis can be done to understand if there are genuine anomalies that need to be kept in the dataset. In this case, removing the outliers may not be the best option, and it may be necessary to adjust the model or incorporate other data to address the outliers.

Finding the number of rows where the target variable is out of the normal range.

df_outliers=np.where((df['C6H6(GT)'] <= lower_val) | (df['C6H6(GT)'] >= upper_val))
print("Number of rows with outlier values", len(df_outliers[0]))

We have 606 rows where the target variable is out of the normal range. The approach to handling the outliers here is to take the mean.

df.loc[df["C6H6(GT)"] <=lower_val, "C6H6(GT)"] = df['C6H6(GT)'].mean()
df.loc[df["C6H6(GT)"] >=upper_val, "C6H6(GT)"] = df['C6H6(GT)'].mean()

Ensuring all the missing dates and times are represented in the data and then sorting the data by timestamp

df.drop([ 'Date','Time', 'Datetime'], axis=1, inplace=True)
df=df.sort_values(['Datetime']).apply(lambda x:x.fillna(method='ffill'))

Checking if the dataset has any missing values

df[df.isna().any(axis=1)]

Now the dataset is ready after handling missing values, and outliers, sorted by DateTime.

Next, performs a seasonal decomposition of the time series data stored in the “C6H6(GT)” column of the data frame

Seasonal decomposition is a method of breaking down a time series into its trend, seasonal, and residual components. The “additive” model is used to perform this decomposition, which assumes that the seasonal component is constant over time.

plt.rc('figure',figsize=(12,8))
plt.rc('font',size=15)
result = sm.tsa.seasonal_decompose(df["C6H6(GT)"],model='additive')
fig = result.plot()

Preprocess the data

Perform the preprocessing of the data using the MinMaxScaler method for feature scaling.

The MinMaxScaler scales the input features between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
sc_in = MinMaxScaler(feature_range=(0, 1))
df_scaled = sc_in.fit_transform(df.loc[:,df.columns!='DateTime'])
df_scaled =pd.DataFrame(df_scaled, columns=df.columns)

Create the training and test dataset

The target variable is C6H6(GT), and the rest of the columns are input features.

When splitting a dataset into training and testing sets, there are different strategies that can be used based on the specific use case.

Date-based strategy for Time Series data: Splitting the data based on the date ensures that the training data only contains information that was available before the test data.
Percentage-based strategy: Splitting the data based on percentages may be more appropriate where the goal is to build a model that performs well across different segments of the data.

The below code uses date based strategy for splitting the train and test datasets.

df_scaled['Datetime']= df.index
X=df_scaled
X = X.set_index(X['Datetime'])
X=X.drop(['Datetime'], axis=1)
X= df.drop(["C6H6(GT)"], axis=1)
y= df["C6H6(GT)"]

#split the data based on the date
split_date="11-09-2004"
X_train=X.loc[X.index <=split_date].copy()
y_train=y.loc[y.index <=split_date].copy()
X_test=X.loc[X.index >split_date].copy()
y_test=y.loc[y.index >split_date].copy()

Feature Selection

Feature selection is an important step in machine learning to

Reduce overfitting
Increase model interpretability
Improve model performance
Reduce the computational resources by simplifying the model

Feature selection identifies which input features most relevant to the target variable to build a more accurate and interpretable model.

Train the XGBoost Model

Training the XGBoost model with the training dataset using the default hyperparameters

base_model= xgb.XGBRegressor(n_estimators=100, max_depth=4,early_stopping_rounds=100)
base_model.fit(X_train, y_train,
       eval_set=[(X_train, y_train), (X_test, y_test)],verbose=50)

Feature Importances

Feature importance is a valuable technique of the machine learning workflow, as it helps us to identify the most relevant input features and build more effective models. By using feature importance techniques, we can gain deeper insights into our data and make better decisions about how to build and optimize our models.

Feature importance helps to explain the ML model by highlighting which features are most influential in the decision-making process.

plot_importance is a utility provided by XGBoost to visualize the relative importance of each feature used in the model

_=plot_importance(base_model, height=0.9)

The plot is a bar chart showing the relative importance of each feature in the trained model. The importance metric used by XGBoost is based on how much each feature reduces the error of the model when it’s used for splitting nodes in the decision trees.

The code below selects only the most important features to simplify the model and remove the noise for the model to be efficient and accurate.

sorted_idx = np.argsort(base_model.feature_importances_)[::-1]
X_selected_cols=pd.DataFrame()
for index in sorted_idx:
    if base_model.feature_importances_[index]>0:
        X_selected_cols[X_train.columns[index]]= X[X.columns[index]]
        print([X.columns[index], base_model.feature_importances_[index]])

Using the important and most impactful input features identified to rebuild the training and test dataset

split_date="11-09-2004"
X_selected_train=X_selected_cols.loc[X.index <=split_date].copy()
y_train=y.loc[y.index <=split_date].copy()
X_selected_test=X_selected_cols.loc[X.index >split_date].copy()
y_test=y.loc[y.index >split_date].copy()

Hyperparameter Tuning

Hyperparameters are parameters set before the training of a machine learning model to optimize the model’s performance of the model by reducing the loss.

The Hyperopt library which is a powerful tool for performing hyperparameter optimization will be used here. Its purpose is to find the best value for hyperparameters used in machine learning algorithms.

Define hyperparameters for optimization

Define the hyperparameter grid using the hyperopt library in Python for hyperparameter optimization

from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
#Import 'scope' from hyperopt in order to 
#obtain int values for certain hyperparameters.
from hyperopt.pyll.base import scope
hyperparameter_grid={'max_depth': scope.int(hp.quniform("max_depth", 1, 15, 1)),
        'gamma': hp.uniform ('gamma', 1,9),
        'reg_lambda' : hp.uniform('reg_lambda', 0,1),
        'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
        'n_estimators': 180,
        'eta': hp.uniform('eta', 0,1),            
        'seed': 0
    }

Define the objective function for Bayesian Optimization

The objective function objective(space) that takes a hyperparameter space as input and returns a dictionary with the loss and status values. It uses Bayesian optimization to evaluate the performance of the XGBoost model.

The objective function first extracts the hyperparameters from the input space and converts them to the appropriate data types; it also sets the remaining hyperparameters that are not part of the hyperparameter space, such as the evaluation metric, objective function, and booster type.

def objective(space):
  
    model = xgb.XGBRegressor(**space, early_stopping_rounds=100,  eval_metric="rmse")
    
    #Define evaluation datasets.
    evaluation = [(X_selected_train, y_train), (X_selected_test, y_test)]
    
    #Fit the model. 
    model.fit(X_selected_train, y_train,
            eval_set=evaluation,
            verbose=False)

    pred = model.predict(X_selected_test)
    mse= mean_squared_error(y_test, pred)
    print ("SCORE:", mse)
    
    return {'loss':mse, 'status': STATUS_OK, 'model': model}

Define hyperparameter optimization

The optimization loop will continue until it has been evaluated max_evals sets of hyperparameters or until it has converged on the optimal hyperparameters.

trials = Trials()

best_hyperparams = fmin(fn = objective,
                        space = hyperparameter_grid,
                        algo = tpe.suggest,
                        max_evals = 500,
                        trials = trials)

Printing the best Parameters and lowest loss value

The code below selects the best model stored in best_model from a set of models with the lowest loss value trained using a hyperparameter optimization technique also called “Bayesian optimization” and its corresponding loss value

best_model = trials.results[np.argmin([r['loss'] for r in 
    trials.results])]['model']
lowest_loss = trials.results[np.argmin([r['loss'] for r in 
    trials.results])]['loss']


print(best_model, lowest_loss)

The model is now ready to be trained on the features selected based on their importance to predict the target variable using the best hyperparameters.

Train the Model

Train the model using the XGBoost. The XGBRegressor is being initialized with several hyperparameters, which have been optimized to achieve the best possible model performance in the previous step using hyperparameter optimization.

xgb_tuned=xgb.XGBRegressor(
                    max_dept=int(best_hyperparams['max_depth']),
                    colsample_bytree=best_hyperparams['colsample_bytree'],
                    objective='reg:squarederror',
                    tree_method='hist',
                    eval_metrics='rmse',
                    eta=best_hyperparams['eta'],
                    gamma=best_hyperparams['gamma'],
                    min_child_weight=best_hyperparams['min_child_weight'],
                    early_stopping_rounds=500,
                    n_estimators=1000,
)
xgb_tuned.fit(X_selected_train, y_train,
              eval_set=[(X_selected_train, y_train), (X_selected_test, y_test)],
              verbose=50)

Evaluate Model Performance

Evaluates the performance of an XGBoost regression model on a test dataset by generating predictions on the test set and calculating the RMSE(root mean squared error) between the predicted and actual target values.

xgb_preds_best = xgb_tuned.predict(X_selected_test)
xgb_score_best = mean_squared_error(y_test, xgb_preds_best, squared=False)
print('RMSE_Best_Model:', xgb_score_best)

Create the data frame

X_best_test=X_selected_test.copy()
X_best_test['pred']= xgb_preds_best
X_best_test['Actual']=y_test
df_X= pd.concat([X_best_test, X_train], sort=False)

Visualize the actual and the predicted values

_=df_X[['pred', 'Actual']].plot(figsize=(10,10))

Model Explainability

Models should be interpretable allowing an understanding of the underlying logic or decision-making process to arrive at its predictions.

plot_tree(xgb_tuned, num_trees=1)
fig = plt.gcf()
fig.set_size_inches(50, 50)

XGBoost uses decision trees employing a gradient-boosting feedback mechanism to iteratively improve its predictions. To explain how the XGBoost model predicts you can plot the decision tree.

This model explainability helps interpret models to build trust and transparency in the decision-making process.

Monitor the Model performance

ML models require constant monitoring as their performance can decrease due to changes in the data.

The changes in data can occur due to

Differences in how the data is collected for training and inference can result in data skew.
Data trends can also change due to factors such as seasonality or unforeseen events like COVID, climate changes, market trends, or new technological developments.

Different techniques can be used for Model Monitoring based on the scenario

Tracking input data: Monitor input data to check if it falls within the expected range and distribution using statistical measures such as mean, variance, and skewness. If the input data deviates significantly from the expected range, it may indicate a problem with the data pipeline or the model’s assumptions.
Tracking Model performance: Comparing the model’s predictions with the ground truth labels and when the model’s performance degrades significantly over time then it needs retraining
Error analysis: Tracking the types of errors the model is making and the patterns in those errors can help identify the issues with the changing data pattern.
End User Feedback: Collecting feedback from end users and incorporating it into the model can help improve its accuracy and relevance.

Conclusion:

In conclusion, building a high-performing ML model involves careful consideration of several critical steps, including data preparation, feature engineering, model selection, training, evaluation, hyperparameter tuning, deployment, and monitoring. These steps require experimentation with different techniques to achieve better performance and provide actual value to your organization. By paying close attention to these steps and continuously monitoring your model’s performance, you can develop an efficient ML model that delivers accurate and reliable predictions, helping you make better-informed decisions.