| NEURAL NETWORKS| AI| ENSEMBLE| DEEP LEARNING|
Neural Ensemble: what’s Better than a Neural Network? A group of them
Neural ensemble: how to combine different neural networks in a powerful model
Tree-based models are still the first choice for tabular data. Although the decision tree is simple, a group of decision trees is a phenomenal model. This is why ensembles have been successful, and models with XGBoost are still the most widely used for tabular data.
In the first article in this series, we discussed why these data are complex, what the limitations are for neural networks, and why decision trees seem to have an advantage. On the other hand, we explained why neural networks are desired instead.
In the second article, on the other hand, we discussed why neural networks natively fail to use categorical variables and what strategies we can use to overcome this problem.
As we mentioned earlier ensembles are proven to be successful, if they work well with decision trees why shouldn’t it with neural networks?
Check the list of references at the end of the article, I provide also some suggestions to deepen the topics.
The wisdom of the crowd
Strength lies in differences, not in similarities. — Stephen Covey
Each model has its merits and flaws, so sometimes it isn't easy to choose the appropriate method. This is especially true for neural networks, which generally have a large variance. In fact, neural networks are models that, being nonlinear, can learn complex nonlinear relationships. This flexibility comes at a price: neural networks are sensitive to different factors, including the random initialization of parameters and the noise present in the training set.
So by training a neural network each time, it can learn a different function between input and output. In addition, each time the neural network can learn different patterns, some of which are only present in the training dataset and thus lead to be detrimental to the generalization ability of the model.
Not to mention that the neural network can be stuck in a local minimum and thus have not reached the optimum. The neural network could perform well on the validation set but fail to generalize on the test set, forcing the model to be re-trained:
… train many different candidate networks and then to select the best, […] and to discard the rest. There are two disadvantages with such an approach. First, all of the effort involved in training the remaining networks is wasted. Second, […] the network which had best performance on the validation set might not be the one with the best performance on new test data . (source)
Neural networks have little bias but high variance. As we saw in the previous article training different models and combining them together reduces variance. This is because each model learns something different, each is right in some aspect and wrong in others. In a more formal way:
The reason that model averaging works is that different models will usually not make all the same errors on the test set. (source)
This was later confirmed by empirical results, where combining the output of different models gave better results than a single model.
Now that we know that an ensemble is more convenient, how can we build an ensemble?
To build a strong ensemble we need a method to increase diversity (the more diverse the better). There are three strategies that we will address in the next sections in detail:
- Vary the training data for each model.
- Vary the models present in the ensemble.
- Combine the outputs of the various models differently.
Not everyone knows everything
When spiders unite, they can tie down a lion. — Unknown
Varying what data a model has access to is the simplest strategy for building an ensemble. K-fold cross-validation (CV) can be seen as an example of an ensemble. Each model is trained iteratively on a portion of the data (5-fold CV, means training five models each on 4/5 different parts of the same dataset). These models could be saved and used as an ensemble
Since data sampling strongly influences ensemble performance different data sampling strategies have been developed. These strategies are mainly divided into two:
- Dependent sampling, where the subsets obtained during sampling are each dependent on the other.
- Independent sampling, where the subsets are not independent. The advantage is that one subset is not affected by the performance of the other subsets.
For both strategies, the difficulty is to determine the size of the subsets and how many different subsets. In any case, I will now discuss some of them in detail.
Bagging or bootstrap aggregating
is one of the standard techniques for constructing subsets. The idea in a nutshell is to build from an initial dataset subsets of the same size and with the same distribution as the original data. In a nutshell, these subsets are obtained using sampling with replacement (so there can be duplicate examples). Models being trained on different subsets will have different generalization errors. After training the models the predictions are generally aggregated with a majority voting.
- Bagging is a relatively simple technique, has the advantage of reducing variance, and works well on high-dimensional data.
- Bagging can be used for unbalanced datasets, where the subsampling strategy is adapted to obtain balanced subsets.
- It has the disadvantage of being expensive, having high bias, and reducing the interpretability of a model. Although expensive, it still has the advantage that it can be parallelized.
- Random forest is an improved example of bagging, in which each tree is trained not only on a different subset but also at each split a random subset of features is selected (to prevent overfitting)
- Bagging’s variants are dagging (disjoint data sets by random sampling without replacement) and wagging (random weights to the samples in the training data using Gaussian distribution or Poisson distribution)
Deep learning ensembles constructed with bagging exist in the literature. Some studies have shown that they are not only superior in performance to the single neural network but bagging can be considered as a regularization technique for neural networks. An ensemble of convolutional neural networks has also been shown to be superior to a single base learner.
Boosting
Boosting was first described in 1997 as a sequential process where each subsequent model attempts to correct the error of the previous model. In boosting each week learner gives greater importance to observations that the previous weak learner misclassified.
- Boosting increases model interpretability and reduces variance and bias.
- As a disadvantage, each subsequent learner must fix the error of the previous learner. It is difficult to scale and parallelize, more prone to overfitting than bagging, and many hyperparameters exist.
- Today several implementations have made it more efficient and it is the first choice for many applications
There have been extensions of boosting to the ensemble of neural networks. One of the earliest examples was boosted convolutional neural networks (CNN) where the network was encountering overfitting if the dataset was not large enough. Incremental boosting CNN is a later version where the outputs of the fully connected layers are used as input features for a selection by boosting an incremental classifier.
In any case, there are also other approaches with CNN, transfer learning, and boosting. For example, AdaBoost-CNN has shown excellent results with unbalanced datasets and efficient training by exploiting transfer learning.
Be different
“In order to be irreplaceable, one must always be different.” — Coco Chanel
Considering the stochastic nature of a neural network, training the same neural network several times will produce different neural networks. This could reduce the variance without reducing the generalization error.
One solution may be to simply vary hyperparameters (learning rate for example) this leads the ensemble to be more heterogeneous. The use of dropout, force sparsity, or other regularization techniques can also be used to generate different patterns.
Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors. (source)
As has been pointed out by some research, dropout can be seen as a way of implicitly creating an ensemble. During training, some hidden units with a probability p (user-chosen hyperparameter) are disabled (set to zero).
Because training a neural network ensemble can be expensive, several groups have tried to develop implicit ensembles. In addition to dropout, other methods have tried to deactivate entire layers (e.g., entire residual blocks in ResNets) or randomly unit and layer together. Another popular way is snapshots.
Taking snapshots
Of course, you can use models with different architectures (different numbers of neurons, different layers, and so on). Alternatively, an interesting approach is to use a single model but take different snapshots during training. In this approach, a different optimization is used so that the model visits different local minima (and snapshots are taken at each local minima). The ensemble is made up of these snapshots and their predictions are averaged.
It is well established (Kawaguchi, 2016) that the number of possible local minima grows exponentially with the number of parameters — of which modern neural networks can have millions […] Although different local minima often have very similar error rates, the corresponding neural networks tend to make different mistakes. (source)
The advantage of this approach is that we have only one model and one training. A variation of this approach is simply to train the model for defined ranges of epochs (take snapshots), then select the best-performing snapshots and use the average performance. In another approach, the authors added noise and varied the training to get more heterogeneous snapshots. Alternatively, since each layer learns a representation of the data you can use as output these representations for a new model.
Model branching
Previous models are systems for creating a diversity of models from a single model. Single model ensembles although they reduce cost also reduce model diversity. One solution to this problem is model branching.
The information that is captured by the initial layers of a neural network is probably the same or similar for all models. For example, CNNs have hierarchical structures where the first layers learn simple representations such as lines, textures, and other patterns. This is the case for all CNN architectures, so training multiple models in part is a waste of resources:
Ensemble approaches likely introduce wasteful duplication of parameters in generic lower layers, increasing training time and model size. The hierarchical nature of CNNs makes them well-suited to alternative ensembling approaches where member models benefit from shared information at the lower layers while retaining the advantages of classical ensembling methods. (source)
So we can have intermediate situations between a single model and an ensemble of models.
In one of the simplest forms, it has been proposed to use different linear layers for each CNN.
Alternatively, convolutional blocks can also be branched. These models have the advantage of reducing the computational cost but also the risk of vanishing gradient because the gradient is propagated for a shorter path. Also, by sharing parameters these models converge much more easily. Alternatively, a pre-trained model can be used as knowledge distillation, where the pre-trained model acts as a teacher and the branched models as students.
Heterogeneous ensembling
Since training a homogeneous ensemble of neural networks is expensive (the base learner is a neural network that is diversified with the techniques seen before) some authors have proposed to create an ensemble of neural networks in combinations with traditional machine learning models. For example, in one study they used an ensemble consisting of XGBoost, neural network, and logistic regression for default prediction. In another, they showed that for text classification, a heterogeneous ensemble obtained better results than a homogeneous ensemble (CNN plus traditional algorithms)
Combining the knowledge of a crowd
You don’t get unity by ignoring the questions that have to be faced. — Jay Weatherill
Once you have an ensemble of models you have to get the final prediction. To achieve this we have to combine the predictions of the various models in the ensemble. There are different strategies for aggregating the predictions of the submodels.
Once you have an ensemble of models you have to get the final prediction. To achieve this we have to combine the predictions of the various models in the ensemble. There are different strategies for aggregating the predictions of the submodels.
Majority voting
For each example, the predictions of each model are collected, and the class that gets the most votes is assigned (e.g., for an example x in binary classification we count how many votes the class 0 and 1 got, and the label will be the class with the most votes. Soft voting is a variant in which we sum (or average) the probabilities assigned to each example instead of the class (hard voting).
- The method is simple but requires that all models in the ensemble have the same good performance and are generally in agreement with the class. it is also recommended for models that have stochastic learning such as neural networks.
- Soft voting should be used when we are interested in the probability of class membership. If soft voting is used it would be better to calibrate the models.
- Since it treats all models equally (same voting rights) it is not recommended when some models in the ensemble perform worse
Unweighted Averaging voting
the predictions of the various models are averaged to obtain the final predictions (it is very similar to majority voting). In short, the predictions are obtained by arithmetic averaging of the various submodels. The probabilities can also be averaged or the softmax function can be used.
- It is superior to max voting and reduces overfitting. For neural network ensembles, it increases generalization because it reduces variance.
- It has similar disadvantages to max voting because it assumes that all models are equally efficient. It is also slightly computationally more expensive.
- It is not recommended when using heterogeneous ensembles
For neural networks, an alternative system for averaging different models can be to average the weights of different models with the same architecture, and thus use a single model:
[the loss surface and the net weights] …suggests it is promising to average these points in weight space, and use a network with these averaged weights, instead of forming an ensemble by averaging the outputs of networks in model space. (source)
Weighted Averaging voting
A slightly modified version of the weighting average where a weight is assigned to each individual sub-learner (this indicated its importance in prediction). Usually, the weight assigned to each model is a measure of how much we trust the submodel. Typically, the sum of all the weights is one, and the value for each model indicates the trust or our expectations for its performance. The main problem is identifying the weights for the various models for the submodel.
Typically, for a dataset, a part of the examples are kept apart as a holdout dataset. Performance is calculated on the holdout dataset. A simple and exhaustive approach is a kind of grid search (0–1 for each model) to look for the best combination. Another approach is to use a linear or gradient descent-based model to be able to obtain weights for the various linear solvers.
- Generally, this method gives better results but is more expensive computationally
Meta learner
Also called “learning to learn,” it is based on the idea that the output of submodels is used by a meta-learner model to learn how to merge previous outputs into the final output (the learning process is in two stages). The meta-learner can be any machine learning model even a neural network. Using neural network stacking can be more advantageous: we can train the ensemble models alone (level 0) and then train a neural network (meta-learner, level 1) to combine the outputs of level 0. Or, during the training of the meta-learner fine-tune the ensemble models.
- Good performance but it is more expensive. In fact, computational time grows with the number of models and data. It is also prone to overfitting when the data are multidimensional and in a multi-label setting. Additionally, stacking ensembles are less interpretable than other models.
- Selecting the right combination of algorithms requires experience and it is missing a rationale. Defining a procedure, instead of trial and error, would make the process much more efficient.
Stacked ensembles have proven successful in numerous competitions in Kaggle, but it is unclear why certain algorithms, certain features, and not others were used. Despite the additional complexity a stacked deep learning ensemble has been used to predict the electricity forecast in Spain and Australia, cancer classification, and more.
Cascade
The cascades are a subset of the ensemble. In the ensemble, models are trained in parallel, and final predictions are combined. In contrast, in cascades, you run the models sequentially and average the predictions only when you have enough confidence. The advantage is that you use less computation, especially if the input is simple. For complex inputs, the system might call up multiple models (instead of a fixed number as in ensembles) and thus lead to higher computational costs.
In a recent Google study, ensembles are particularly efficient for high computational regimes (above 5B parameters). So instead of using a large model, an ensemble might be more efficient. Cascades though turn out to be more efficient because they save from additional computation especially if they have an early stopping mechanism (we don’t add models when we are satisfied with the performance).
Mixture of experts
A mixture of experts (MOE) is a particular ensembling technique specifically used for neural networks (although it could be generalized for any model). The basic idea is to decompose a task into a series of sub-tasks. After that, a model (an expert) is trained on each subtask. Finally, a gating model learns which expert should be used to predict an input and how to combine the predictions.
MoE is a modular neural network architecture. They are the simplest and most successful modular neural network architectures. MoE consists of modules, called experts, and a gate. The experts and the gate are simple neural networks. (source)
For example, if you need to predict the class of an image, each expert focuses on a task (background, foregrounds, colors, objects, and so on). Each expert will receive the same inputs and learn how to make a prediction. Another neural network called the gating model interprets the experts’ predictions and decides which expert should be believed for an input (typically the gating model has a softmax layer that outputs a probability for each expert). Both the experts and the gating network are trained together by expectation maximization. Finally, the predictions are aggregated (the expert with the highest probability for the gating model, pooling, and so on).
Of course, it is not restricted to images. For tabular data, some columns or features can be assigned to each expert. Or, a particular part of the feature to be learned.
MOEs have several advantages:
- Training is fast and so is inference since computation is conditional (experts may not all be called at a given time).
- Transferability of sub-tasks learned by experts to other tasks (so it has been proposed for continual learning).
- Ability to solve multimodal problems when using heterogeneous experts.
- Extension to multi-task with multi-gate.
Open research questions and future challenges
You don’t get unity by ignoring the questions that have to be faced. — Jay Weatherill
Although neural ensembles have shown their effectiveness empirically, a theoretical framework is still lacking. In 1992, the description of the trade-off between bias and variance for neural networks laid the theoretical foundation for neural ensembles. In general, subsequent works have expanded this discussion, especially describing that some of the local minima are better than others in terms of generalization. This laid the foundation for snapshot training, in which precisely different local minima are exploited. Instead, the theoretical description of dropout as a form of averaging inspired implicit ensemble.
There are still theoretical aspects that should be better investigated and would open the possibility for new models.
Other interesting challenges also still remain, some of which are especially associated with tabular data. For some of these challenges, ensembles might be an optimal strategy, but for others, they create additional complexity:
- Small size. Neural networks work much better with large datasets, but sometimes collecting large datasets is not possible, leading to unstable predictions and little reproducibility. Neural ensembles are more stable and struggle less with the variance of small datasets, though there is still room for improvement.
- High dimensionality. The curse of dimensionality is the prime cause of overfitting since we need numerous parameters to handle numerous features. Neural ensemble can reduce the problem.
- Class imbalance. is another classic problem where ensembles are functional.
- Noise and heterogeneity. neural networks especially for small datasets may have problems in mapping the optimal feature. Ensembles are more stable in response to noise.
- Interpretability. Neural networks are notoriously opaque and ensembles pose additional complexity (especially in the case of stacking).
- Network architecture. The choice of architecture is crucial, especially for the domain and applications. On the other hand, this choice is often derived from experience and trial and error. Every year new architectures are proposed that open up new possibilities but at the same time, we have no framework for choosing the best architectures for an ensemble. In addition, hybrid architectures sometimes give better results, making the choice more difficult.
- Computational expense. Deep learning models are computationally expensive per se; neural ensembles are significantly more expensive. We have seen architectures and methods to reduce both the number of parameters and training time; improvements in hardware and techniques such as federated learning will bring additional benefits. In any case, more efficient ensemble deep-learning algorithms would be a definite improvement.
Parting thoughts
Talent perceives differences; genius, unity. — William Butler Yeats
As we have seen the same techniques that are used to create ensembles of classical machine learning can be used to create ensembles of neural networks. In addition, neural networks are more versatile and allow ensembles to be created with elegant solutions (such as snapshot ensembling).
Compared to the single model, ensembles improve performance, are more robust and stable, reduce bias and variance, are more reliable, and have greater capability in handling complex and noisy data.
All these advantages come at a high computational cost, additional training data to reach optimal performance, potential overfitting, increased model complexity, and ensembles designed for one task are more difficult to adapt for another task.
Finding a compromise may not be an easy task, though, ensembles provide great flexibility. For example, GPT-4 the most advanced Large Language Model would actually be an ensemble (at MOE of 8 models). This shows that neural ensemble research is more active than ever before.
What do you think? Let me know in the comments
If you have found this interesting:
You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn.
Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, artificial intelligence, and more.
or you may be interested in one of my recent articles:
Reference
Here is the list of the principal references I consulted to write this article (only the first author name of an article is cited).
- Li, 2017, Visualizing the Loss Landscape of Neural Nets, link
- Ganaia, 2021, Ensemble deep learning: A review, link
- Mohammed, 2023, A comprehensive review on ensemble deep learning: Opportunities and challenges, link
- Cao, 2020, Ensemble deep learning in bioinformatics, link
- Dr. Roi Yehoshua, Introduction to Ensemble Methods, link
- Esteban Thilliez, Data Science with Python — Ensemble Methods, link
- Saupin Guillaume, XGBoost explained: DIY XGBoost library in less than 200 lines of python, link
- Thomas A Dorfer, Bagging vs. Boosting: The Power of Ensemble Methods in Machine Learning, link
- Patrizia Castagno, An Easy Guide to Understanding Ensemble Methods, link
- Ju, 2017, The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classification, link
- Han, 2017, Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition, link
- Taherkhani, 2020, AdaBoost-CNN: An adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning, link
- Huang, 2017, Snapshot Ensembles: Train 1, get M for free, link
- Xie, 2013, Horizontal and Vertical Ensemble with Deep Representation for Classification, link
- Loshchilov, 2016, SGDR: Stochastic Gradient Descent with Warm Restarts
- Hara, 2017, Analysis of dropout learning regarded as ensemble learning, link
- Huang, 2016, Deep Networks with Stochastic Depth, link
- Lee, 2015, Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks, link
- Izmailov, 2018, Averaging Weights Leads to Wider Optima and Better Generalization, link
- Chatzimparmpas, 2020, StackGenVis: Alignment of Data, Algorithms, and Models for Stacking Ensemble Learning Using Performance Metrics, link
- Bengio, 2009, Learning Deep Architectures for AI, link
- Yang, 2022, A Survey on ensemble learning under the era of deep learning, link
- Krishnamurthy, 2023, Improving Expert Specialization in Mixture of Experts, link