Data Science with Python — Ensemble Methods
This article is part of the “Datascience with Python” series. You can find the other stories of this series below:
Have you ever heard of ensemble methods? They allow us to combine the predictions of multiple models to achieve better overall performance in data science.
We will explore this concept and discover how we can build a model based on ensemble methods today, with Python.
What Are Ensemble Methods?
Ensemble methods in data science refer to the combination of multiple machine learning models to make predictions or decisions. Instead of relying on a single model, ensemble methods leverage the collective wisdom of several models to improve overall accuracy and performance.
There are several advantages to using ensemble methods in data science:
- Improved Accuracy: Ensemble methods can often achieve higher accuracy compared to individual models by reducing bias and variance. Ensemble methods can compensate for the weaknesses of individual models and produce more reliable predictions.
- Increased Robustness: Ensemble methods are less susceptible to overfitting, a common problem in machine learning. They can reduce the impact of outliers and noise in the data, leading to more robust and generalizable models.
- Better Handling of Complex Data: Ensemble methods are particularly effective when dealing with complex datasets. Indeed, they can capture different aspects of the data and provide a more comprehensive understanding of the complex underlying patterns.·
Most of the ensemble methods fall into 4 categories.
First, we have “bagging”. It shorts for bootstrap aggregating and involves training multiple models on different subsets of the training data. Each model produces its own prediction, and the final prediction is determined by combining the predictions of all models. Bagging is often used with decision trees to create random forests.
Then, we have “boosting”, an iterative ensemble method that focuses on sequentially improving the performance of individual models. Models are trained in a step-wise manner, where each subsequent model corrects the mistakes made by previous models. Some famous boosting algorithms are AdaBoost and Gradient Boosting, maybe you have already heard of them?
We also have “stacking”, also known as stacked generalization. It combines the predictions of multiple models using another model called a meta-learner or blender. The meta-learner learns to combine the predictions of base models to produce the final prediction. Stacking can be seen as a two-level learning process, where the base models learn from the data, and the meta-learner learns from the predictions of the base models.
Finally, “voting”, as the name suggests, involves aggregating the predictions of multiple models through a voting mechanism. There are different types of voting, such as majority voting, weighted voting, and soft voting, each with its own rules for combining predictions.
Prerequisites
If you’ve checked the other articles of this series, you probably know everything I will tell here and you can skip this section.
Else, you should start with installing sklearn
with pip install scikit-learn
. You should also install pandas
with pip install pandas
.
Then, you can dive into the code. As a reminder, to import a dataset and split it into training and testing sets, we use this code:
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('dataset.csv')
X = data.drop('target', axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In this code, ‘target’ represents the target variable in your dataset that you want to predict. Adjust the test_size parameter according to your preference for the size of the testing set.
In this article, I will suppose you have some prior knowledge about data science with Python (sklearn, pandas, data preprocessing, etc…). If you don’t have this knowledge, check my other stories about data science and come back here later. You can find all of them below:
Bagging
Bagging works by creating multiple subsets of the original dataset through a technique called bootstrapping. Bootstrapping involves randomly sampling observations from the original dataset with replacement, which means that some observations may be selected multiple times while others may not be selected at all. Each subset is then used to train a separate model.
The most famous implementation of bagging is the random forest algorithm. It is based on the concept of decision trees, where each tree is built using a random subset of features and a random subset of the data.
In Python, we just have to import sklearn.ensemble
and we can then use the RandomForestClassifier
or the RandomForestRegressor
.
Another variant of bagging is the Extra Trees algorithm, short for Extremely Randomized Trees. Similar to Random Forest, it builds multiple decision trees using bootstrapped samples. However, unlike Random Forest, Extra Trees further randomizes the construction of each tree by considering random thresholds for each feature instead of searching for the best split.
Implementing Extra Trees in Python follows a similar pattern as Random Forest. The classes are ExtraTreesClassifier
and ExtraTreesRegressor
.
Boosting: AdaBoost and Gradient Boosting
Boosting works by iteratively training models, where each subsequent model aims to correct the mistakes made by the previous models. The training data for each model is reweighted, with more emphasis given to the instances that were misclassified previously.
AdaBoost, short for Adaptive Boosting, is a boosting algorithm that assigns weights to training instances and adjusts them in each iteration based on the misclassification rate. It focuses on improving the accuracy of the instances that are difficult to classify by giving them higher weights. The final model is an aggregation of weak models, where each model contributes with a weight proportional to its performance.
Implementing AdaBoost in Python is straightforward using scikit-learn. We have AdaBoostClassifier
or AdaBoostRegressor
.
Then, Gradient Boosting is another powerful boosting algorithm that builds models in a sequential manner. Instead of adjusting instance weights, it focuses on minimizing a loss function by optimizing the model’s parameters. In each iteration, the new model is trained to minimize the residual errors of the previous models. This approach makes Gradient Boosting effective in both classification and regression tasks.
Implementing Gradient Boosting in Python can be done using libraries such as XGBoost or LightGBM. Let’s try XGBoost.
pip install xgboost
Then, for a regression, our code would look like this:
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
california = fetch_california_housing()
X, y = california.data, california.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Convert the data into DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Define the parameters for the XGBoost model
params = {
'objective': 'reg:squarederror', # Use squared error for regression
'max_depth': 3, # Maximum depth of each tree
'eta': 0.1, # Learning rate
'gamma': 0.1, # Minimum loss reduction required to make a further partition
'subsample': 0.8, # Subsample ratio of the training instances
'colsample_bytree': 0.8, # Subsample ratio of features when constructing each tree
'eval_metric': 'rmse' # Evaluation metric to use
}
# Train the XGBoost model
num_rounds = 100 # Number of boosting rounds
model = xgb.train(params, dtrain, num_rounds)
# Make predictions on the test set
y_pred = model.predict(dtest)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
For classification, it would be the same code, except the params
dict and the metric we use to evaluate the model would be different:
params = {
'objective': 'multi:softmax', # Use softmax for multiclass classification
'num_class': 3, # Number of classes
'max_depth': 3,
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8,
'eval_metric': 'merror'
}
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
Stacking
Stacking goes beyond simple averaging or voting of individual model predictions. It involves training multiple diverse models on the same data and then combining their predictions using a meta-model. The meta-model learns to weigh the predictions of the base models, taking into account their individual strengths and weaknesses.
To implement stacking in Python, start by building a set of base models. These can be any machine learning models of your choice, such as decision trees, random forests, or support vector machines. Train each base model on the training data and obtain their predictions for both the training and test data.
After obtaining the predictions from the base models, the next step is to create a meta-model. This meta-model takes the predictions as input features and learns to make the final predictions. It can be any machine learning algorithm, such as logistic regression, neural networks, or gradient boosting. The meta-model is trained using the training data along with the base model predictions.
For example, here is the code for a classification:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
data = load_iris()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the base models
base_model_1 = RandomForestClassifier(random_state=42)
base_model_2 = LogisticRegression(random_state=42)
# Train the base models
base_model_1.fit(X_train, y_train)
base_model_2.fit(X_train, y_train)
# Make predictions using the base models
pred_1 = base_model_1.predict(X_test)
pred_2 = base_model_2.predict(X_test)
# Create a new training set with the predictions from the base models
meta_features = np.column_stack((pred_1, pred_2))
# Train the meta-model
meta_model = LogisticRegression(random_state=42)
meta_model.fit(meta_features, y_test)
# Make predictions using the stacked model
stacked_pred = meta_model.predict(meta_features)
# Evaluate the performance of the stacked model
stacked_accuracy = accuracy_score(y_test, stacked_pred)
print("Stacked Model Accuracy:", stacked_accuracy)
Voting
Voting involves training multiple models on the same data and aggregating their predictions using a voting rule. There are two types of voting: hard voting and soft voting.
Hard voting counts the class labels predicted by each model and selects the class label that receives the majority of votes. Soft voting takes the average probabilities predicted by each model for each class and selects the class with the highest average probability as the final prediction.
Voting is particularly effective when the models used have different strengths and weaknesses.
To implement voting in Python, start by creating instances of the individual models that you want to include in the ensemble. These can be any machine learning models, such as logistic regression, decision trees, or support vector machines.
Next, create an instance of the voting classifier or regressor depending on the task at hand. Specify the type of voting, whether it’s hard or soft, and provide the list of models to include in the ensemble. Fit the voting classifier to the training data.
Finally, you can use the trained voting classifier to make predictions on new data.
Here is the code example:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()
# Create the voting classifier
voting_classifier = VotingClassifier(
estimators=[('lr', model1), ('dt', model2), ('svm', model3)],
voting='hard' # Use 'soft' for weighted voting
)
voting_classifier.fit(X_train, y_train)
y_pred = voting_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Evaluating and Fine-tuning Ensemble Models
When evaluating ensemble models, common performance metrics include accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC).
We can combine this with cross-validation. It involves splitting the dataset into multiple folds and iteratively training and testing the models on different combinations of these folds. Cross-validation helps estimate the model’s performance on unseen data.
Another fine-tuning technique is using hyperparameter tuning. Two popular approaches for hyperparameter tuning are grid search and random search.
Grid search involves defining a grid of hyperparameter values and exhaustively searching through all possible combinations. It evaluates the model performance for each combination and selects the set of hyperparameters that yield the best results. Grid search is comprehensive but can be computationally expensive.
A grid of hyperparameters is something like this:
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 1, 10]
}
We can then perform grid search:
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("Best hyperparameters: ", grid_search.best_params_)
print("Best model: ", grid_search.best_estimator_)
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Accuracy: ", accuracy)
Random search, on the other hand, randomly samples from a predefined search space of hyperparameters. It performs multiple iterations, evaluating the model performance for each random combination of hyperparameters. Random search is more efficient than grid search and often leads to similar or even better results.
param_grid = {
'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]
}
rf = RandomForestClassifier()
random_search = RandomizedSearchCV(
estimator=rf,
param_distributions=param_grid,
n_iter=10, # Number of parameter settings that are sampled
scoring='accuracy', # Scoring metric to evaluate the models
cv=5,
random_state=42
)
random_search.fit(X, y)
print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)
Final Note
Ensemble methods are powerful for improving predictive accuracy and robustness. And fortunately, implementing ensemble methods in Python is made easy with libraries such as scikit-learn, or XGBoost.
In a next article, we’ll see a use case of ensemble methods to put into practice what you’ve learned today!
To explore the other stories of this story, click below!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: