avatarEsteban Thilliez

Summary

The provided web content discusses ensemble methods in data science with Python, detailing their benefits, types, implementation, and evaluation for improved predictive performance.

Abstract

The article "Data Science with Python — Ensemble Methods" delves into the concept of ensemble methods, which combine multiple machine learning models to enhance overall prediction accuracy and robustness. It outlines the four main categories of ensemble methods: bagging, boosting, stacking, and voting. The author explains the advantages of ensemble methods, such as reduced bias and variance, increased robustness against overfitting, and better handling of complex data. Specific techniques like Random Forests and Extra Trees for bagging, AdaBoost and Gradient Boosting for boosting, and the use of meta-learners for stacking are described. The article also covers practical aspects, including the prerequisites for implementing these methods, such as installing sklearn and pandas, and provides code examples for bagging, boosting, and stacking using Python libraries. Additionally, it touches on the importance of evaluating ensemble models using metrics like accuracy, precision, recall, F1 score, and AUC-ROC, as well as the use of cross-validation and hyperparameter tuning for fine-tuning. The article concludes with a note on the power of ensemble methods and a hint at a future use case example.

Opinions

  • The author posits that ensemble methods are superior to individual models due to their ability to leverage collective wisdom.
  • There is an emphasis on the practicality and ease of implementing ensemble methods in Python, thanks to libraries such as sklearn, XGBoost, and LightGBM.
  • The author suggests that readers should have prior knowledge of data science with Python, including familiarity with sklearn and pandas, to fully understand the content.
  • The article conveys that stacking is more sophisticated than simple averaging or voting, as it uses a meta-model to combine predictions based on individual model strengths.
  • The author expresses a preference for random search over grid search for hyperparameter tuning due to its efficiency and often superior results.
  • There is an opinion that ensemble methods are not only powerful but also accessible for practitioners to implement and benefit from in their data science projects.

Data Science with Python — Ensemble Methods

Photo by Pietro Jeng on Unsplash

This article is part of the “Datascience with Python” series. You can find the other stories of this series below:

Have you ever heard of ensemble methods? They allow us to combine the predictions of multiple models to achieve better overall performance in data science.

We will explore this concept and discover how we can build a model based on ensemble methods today, with Python.

What Are Ensemble Methods?

Ensemble methods in data science refer to the combination of multiple machine learning models to make predictions or decisions. Instead of relying on a single model, ensemble methods leverage the collective wisdom of several models to improve overall accuracy and performance.

There are several advantages to using ensemble methods in data science:

  • Improved Accuracy: Ensemble methods can often achieve higher accuracy compared to individual models by reducing bias and variance. Ensemble methods can compensate for the weaknesses of individual models and produce more reliable predictions.
  • Increased Robustness: Ensemble methods are less susceptible to overfitting, a common problem in machine learning. They can reduce the impact of outliers and noise in the data, leading to more robust and generalizable models.
  • Better Handling of Complex Data: Ensemble methods are particularly effective when dealing with complex datasets. Indeed, they can capture different aspects of the data and provide a more comprehensive understanding of the complex underlying patterns.·

Most of the ensemble methods fall into 4 categories.

First, we have “bagging”. It shorts for bootstrap aggregating and involves training multiple models on different subsets of the training data. Each model produces its own prediction, and the final prediction is determined by combining the predictions of all models. Bagging is often used with decision trees to create random forests.

Then, we have “boosting”, an iterative ensemble method that focuses on sequentially improving the performance of individual models. Models are trained in a step-wise manner, where each subsequent model corrects the mistakes made by previous models. Some famous boosting algorithms are AdaBoost and Gradient Boosting, maybe you have already heard of them?

We also have “stacking”, also known as stacked generalization. It combines the predictions of multiple models using another model called a meta-learner or blender. The meta-learner learns to combine the predictions of base models to produce the final prediction. Stacking can be seen as a two-level learning process, where the base models learn from the data, and the meta-learner learns from the predictions of the base models.

Finally, “voting”, as the name suggests, involves aggregating the predictions of multiple models through a voting mechanism. There are different types of voting, such as majority voting, weighted voting, and soft voting, each with its own rules for combining predictions.

Prerequisites

If you’ve checked the other articles of this series, you probably know everything I will tell here and you can skip this section.

Else, you should start with installing sklearn with pip install scikit-learn . You should also install pandas with pip install pandas .

Then, you can dive into the code. As a reminder, to import a dataset and split it into training and testing sets, we use this code:

import pandas as pd
from sklearn.model_selection import train_test_split


data = pd.read_csv('dataset.csv')

X = data.drop('target', axis=1)
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In this code, ‘target’ represents the target variable in your dataset that you want to predict. Adjust the test_size parameter according to your preference for the size of the testing set.

In this article, I will suppose you have some prior knowledge about data science with Python (sklearn, pandas, data preprocessing, etc…). If you don’t have this knowledge, check my other stories about data science and come back here later. You can find all of them below:

Bagging

Bagging works by creating multiple subsets of the original dataset through a technique called bootstrapping. Bootstrapping involves randomly sampling observations from the original dataset with replacement, which means that some observations may be selected multiple times while others may not be selected at all. Each subset is then used to train a separate model.

The most famous implementation of bagging is the random forest algorithm. It is based on the concept of decision trees, where each tree is built using a random subset of features and a random subset of the data.

In Python, we just have to import sklearn.ensemble and we can then use the RandomForestClassifier or the RandomForestRegressor .

Another variant of bagging is the Extra Trees algorithm, short for Extremely Randomized Trees. Similar to Random Forest, it builds multiple decision trees using bootstrapped samples. However, unlike Random Forest, Extra Trees further randomizes the construction of each tree by considering random thresholds for each feature instead of searching for the best split.

Implementing Extra Trees in Python follows a similar pattern as Random Forest. The classes are ExtraTreesClassifier and ExtraTreesRegressor .

Boosting: AdaBoost and Gradient Boosting

Boosting works by iteratively training models, where each subsequent model aims to correct the mistakes made by the previous models. The training data for each model is reweighted, with more emphasis given to the instances that were misclassified previously.

AdaBoost, short for Adaptive Boosting, is a boosting algorithm that assigns weights to training instances and adjusts them in each iteration based on the misclassification rate. It focuses on improving the accuracy of the instances that are difficult to classify by giving them higher weights. The final model is an aggregation of weak models, where each model contributes with a weight proportional to its performance.

Implementing AdaBoost in Python is straightforward using scikit-learn. We have AdaBoostClassifier or AdaBoostRegressor .

Then, Gradient Boosting is another powerful boosting algorithm that builds models in a sequential manner. Instead of adjusting instance weights, it focuses on minimizing a loss function by optimizing the model’s parameters. In each iteration, the new model is trained to minimize the residual errors of the previous models. This approach makes Gradient Boosting effective in both classification and regression tasks.

Implementing Gradient Boosting in Python can be done using libraries such as XGBoost or LightGBM. Let’s try XGBoost.

pip install xgboost

Then, for a regression, our code would look like this:

import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


california = fetch_california_housing()
X, y = california.data, california.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert the data into DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Define the parameters for the XGBoost model
params = {
    'objective': 'reg:squarederror',  # Use squared error for regression
    'max_depth': 3,                   # Maximum depth of each tree
    'eta': 0.1,                       # Learning rate
    'gamma': 0.1,                     # Minimum loss reduction required to make a further partition
    'subsample': 0.8,                 # Subsample ratio of the training instances
    'colsample_bytree': 0.8,          # Subsample ratio of features when constructing each tree
    'eval_metric': 'rmse'             # Evaluation metric to use
}

# Train the XGBoost model
num_rounds = 100  # Number of boosting rounds
model = xgb.train(params, dtrain, num_rounds)

# Make predictions on the test set
y_pred = model.predict(dtest)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

For classification, it would be the same code, except the paramsdict and the metric we use to evaluate the model would be different:

params = {
    'objective': 'multi:softmax',  # Use softmax for multiclass classification
    'num_class': 3,                 # Number of classes
    'max_depth': 3,                 
    'eta': 0.1,                     
    'subsample': 0.8,               
    'colsample_bytree': 0.8,        
    'eval_metric': 'merror'         
}
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Stacking

Stacking goes beyond simple averaging or voting of individual model predictions. It involves training multiple diverse models on the same data and then combining their predictions using a meta-model. The meta-model learns to weigh the predictions of the base models, taking into account their individual strengths and weaknesses.

To implement stacking in Python, start by building a set of base models. These can be any machine learning models of your choice, such as decision trees, random forests, or support vector machines. Train each base model on the training data and obtain their predictions for both the training and test data.

After obtaining the predictions from the base models, the next step is to create a meta-model. This meta-model takes the predictions as input features and learns to make the final predictions. It can be any machine learning algorithm, such as logistic regression, neural networks, or gradient boosting. The meta-model is trained using the training data along with the base model predictions.

For example, here is the code for a classification:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


data = load_iris()
X, y = data.data, data.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the base models
base_model_1 = RandomForestClassifier(random_state=42)
base_model_2 = LogisticRegression(random_state=42)

# Train the base models
base_model_1.fit(X_train, y_train)
base_model_2.fit(X_train, y_train)

# Make predictions using the base models
pred_1 = base_model_1.predict(X_test)
pred_2 = base_model_2.predict(X_test)

# Create a new training set with the predictions from the base models
meta_features = np.column_stack((pred_1, pred_2))

# Train the meta-model
meta_model = LogisticRegression(random_state=42)
meta_model.fit(meta_features, y_test)

# Make predictions using the stacked model
stacked_pred = meta_model.predict(meta_features)

# Evaluate the performance of the stacked model
stacked_accuracy = accuracy_score(y_test, stacked_pred)
print("Stacked Model Accuracy:", stacked_accuracy)

Voting

Voting involves training multiple models on the same data and aggregating their predictions using a voting rule. There are two types of voting: hard voting and soft voting.

Hard voting counts the class labels predicted by each model and selects the class label that receives the majority of votes. Soft voting takes the average probabilities predicted by each model for each class and selects the class with the highest average probability as the final prediction.

Voting is particularly effective when the models used have different strengths and weaknesses.

To implement voting in Python, start by creating instances of the individual models that you want to include in the ensemble. These can be any machine learning models, such as logistic regression, decision trees, or support vector machines.

Next, create an instance of the voting classifier or regressor depending on the task at hand. Specify the type of voting, whether it’s hard or soft, and provide the list of models to include in the ensemble. Fit the voting classifier to the training data.

Finally, you can use the trained voting classifier to make predictions on new data.

Here is the code example:

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


iris = load_iris()
X, y = iris.data, iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = SVC()

# Create the voting classifier
voting_classifier = VotingClassifier(
    estimators=[('lr', model1), ('dt', model2), ('svm', model3)],
    voting='hard'  # Use 'soft' for weighted voting
)

voting_classifier.fit(X_train, y_train)

y_pred = voting_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Evaluating and Fine-tuning Ensemble Models

When evaluating ensemble models, common performance metrics include accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC).

We can combine this with cross-validation. It involves splitting the dataset into multiple folds and iteratively training and testing the models on different combinations of these folds. Cross-validation helps estimate the model’s performance on unseen data.

Another fine-tuning technique is using hyperparameter tuning. Two popular approaches for hyperparameter tuning are grid search and random search.

Grid search involves defining a grid of hyperparameter values and exhaustively searching through all possible combinations. It evaluates the model performance for each combination and selects the set of hyperparameters that yield the best results. Grid search is comprehensive but can be computationally expensive.

A grid of hyperparameters is something like this:

param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': [0.1, 1, 10]
}

We can then perform grid search:

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

print("Best hyperparameters: ", grid_search.best_params_)
print("Best model: ", grid_search.best_estimator_)

best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print("Accuracy: ", accuracy)

Random search, on the other hand, randomly samples from a predefined search space of hyperparameters. It performs multiple iterations, evaluating the model performance for each random combination of hyperparameters. Random search is more efficient than grid search and often leads to similar or even better results.

param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier()

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_grid,
    n_iter=10,  # Number of parameter settings that are sampled
    scoring='accuracy',  # Scoring metric to evaluate the models
    cv=5,  
    random_state=42  
)


random_search.fit(X, y)


print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

Final Note

Ensemble methods are powerful for improving predictive accuracy and robustness. And fortunately, implementing ensemble methods in Python is made easy with libraries such as scikit-learn, or XGBoost.

In a next article, we’ll see a use case of ensemble methods to put into practice what you’ve learned today!

To explore the other stories of this story, click below!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data Science
Data
Python
Programming
AI
Recommended from ReadMedium