A Deep Dive into Stacking Ensemble Machine Learning — Part I
How to use stacking effectively in machine learning by fully understanding what stacking is and how it works

Background
I had read about stacking in several books and also looked up stacking examples on Kaggle and on other web sites found by searching on line.
It was clear from the reading and research that stacking had the potential to increase the accuracy of predictive algorithms and to improve leader board results on Kaggle and also the accuracy and impact of machine learning algorithms in the real world.
The main problem was that nothing I read adequately explained what stacking really is or how it worked. Another issue was that the articles, blogs, books and documentation were contradictory in the detailed explanation and implementation of stacking.
This left me wanting to know more to satisfy my own curiosity and also to enable me to know when stacking as a technique is appropriate and also when to avoid it.
Thus began a period of reading and research together with practical experimentation using Python and Jupyter Notebooks. These are a few of the most useful sources I identified during the investigation -
- Kaggle (https://www.kaggle.com/).
- The
sci-kit learn
documentation (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html). - Various tutorials in “Machine Learning Mastery” (https://machinelearningmastery.com/).
- The book “Approaching (Almost) Any Machine Learning Problem by Abhisek Thakur” (https://www.amazon.co.uk/Approaching-Almost-Machine-Learning-Problem-ebook/dp/B089P13QHT).
Overview
The following set of steps are a visual representation of the closest thing I have found to a consensus of opinion from all the various sources I have studied in respect of stacking -
Step 1: Split the Data into Training and Testing / Validation Datasets

The first step is straight-forward to visualise and replicates a common first step in machine learning. The Training data will be used to build the stacking model and the testing / validation data will be held back and used to evaluate performance.
In the diagram above the rectangle representing the data is split into two along the vertical axis. The larger section represents the features and the smaller section / column at the end represents the target.
Step 2a: Train the Level 0 Models
Stacking is a two layer model. This diagram represents visually what goes on in “Level 0” which is the first stage -

Essentially what is going on here is an ingenious piece of feature engineering.
A traditional example of feature engineering would be the creation of a new column called “Distance Travelled” by multiplying two pre-existing columns — “Speed” and “Time”. It might be that engineering this new feature provides relevant information that improves the predictive performance.
In the Level 0 stage of stacking a collection of predictive models are used to independently predict the values of the target given the training dataset original features. Those predictions are then added to the training data as new features.
Different sources are contradictory about how this stage works in detail. Abhishek Thakur states that the training data should be folded to generate the predictions for the new features whilst the scikit-learn
documentation states the opposite -
“Note that estimators are fitted on the full X” (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)
In practice I have tried both ways and taking the scikit-learn approach significantly improves predictive performance on the datasets I have used. I also favour the scikit-learn approach as it makes stage 4 much more intuitive.
Step 2b: Fine-tuning for Step 2a
The final issue to consider is what exactly to use from the Level 0 Models to transform the training data and there are several options.
The first decision is whether to wholly replace the existing features so that the training data comprises of just the Level 0 Model predictions or whether to append the new features onto the training data.
The diagram above shows the predictions being appended to the data and in practice I have found that retaining the original features significantly improves the performance of the completed stacking model.
The second decision is exactly what data to use in the prediction columns.
In a regression the continuous predicted values are used directly, but in a classification there are more choices available -
The first option is to simply use the predicted classes. In a binary classification for each of the columns above (show as orange, blue and green), each row would contain either 1 or 0 based on the Level 0 Model prediction.
However it turns out that if you use the predicted probability instead of the prediction the performance of the stacking model improves significantly. For a binary classifier the predicted probability for either class zero or class one can be used; they are perfectly collinear so it makes no difference to the outcome which one is selected.
In summary, and based on my experimentation, if the machine learning algorithm is solving a binary classification then fine-tune as follows -
- Retain the original features and append the predictions as additional features.
- Use the predicted probabilities for class=1 as the data and do not use the direct prediction.
Step 3: Train the Level 1 Model
Now that the new features have been added by steps 1 and 2 it is time to train the “Level 1 Model”, also referred to as the “final estimator” in some sources -

This stage is very straight-forward compared to the previous steps and we are now ready to make predictions.
The Level 1 Model is simply fitted to the transformed training data and we have our trained stacking model.
Step 4: Make Predictions for the Testing / Validation Data

OK, so it looks a bit scary, but it is actually quite straight-forward.
The test / validation data used as the input to the trained “Level 1 Model” must have the same shape (in terms of the number and order of features) as the training data used to fit it, but as it happens that is very easy to do.
The trained / fitted “Level 0 Models” are applied sequentially to the test / validation data to add the model predictions as new features and in this way the shape of the training and test / validation data will match.
The trained “Level 1 Model” is then applied to the transformed test data to provide a final set of predictions produced by the stacking model.
Conclusion
Stacking is conceptually difficult to grasp, at least that was my experience before I spent a lot of time reading, researching and experimenting.
However, once you have understood stacking it is relatively straight-forward to apply in practice and this will be the subject for Parts II and III in this series of articles.
In Part II of this series of articles we will use the scikit-learn
library to implement a stacking model, to improve our understanding and to evaluate overall performance -
Part III will build a stacking algorithm from scratch and in full to complete a full and deep understanding of stacking and how it works in detail.
Final Word
I certainly struggled initially to fully understand stacking and I did not attain a good understanding until I had studied the scikit-learn
implementation and then built my own stacking algorithm from scratch.
Hopefully this article together with Parts II and III will help others to achieve that understanding without having to carry out all of the research and this will enable informed choices to be made about where and how to implement stacking to achieve optimised performance in predictive machine learning algorithms.
Thank you for reading!
If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/? Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to data science and data analytics.
If you would like to get in touch to discuss any of these topics please look me up on LinkedIn — https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at [email protected].
If you would like to support the author and 1000’s of others who contribute to article writing world-wide by subscribing, please use the following link (note: the author will receive a proportion of the fees if you sign up using this link at no extra cost to you).