In the previous posts, we discussed Bagging and Boosting ensemble learning in ML and how they are useful. We also discussed the algorithms which are based on it i.e. Ada Boost and Gradient Boosting. In this part, we will discuss another ensemble learning technique known as Stacking. We also discuss a bit about Blending (another ensemble learning technique).
Stacking is another way to ensemble multiple classifications or regression models. We have seen that Bagging allows multiple similar models with a high variance to be averaged to decrease variance and Boosting builds multiple incremental models to decrease the bias while keeping variance small.
The purpose of stacking is to explore a space of different models for the same problem. The idea is that we can attack a learning problem with different types of models that are capable of learning some part of the problem, but not the whole space of the problem. So, we’ll be able to build multiple different learners and we utilize them to make an intermediate prediction, one prediction for every learned model. Then we add a new model that learns from the intermediate predictions for the same target.
This final model is claimed to be stacked on top of the others, hence the name. Thus, we would possibly improve the overall performance, and often we end up with a model that is better than any individual intermediate model.
Stacking is a heterogeneous ensemble learning technique i.e. we can use different kinds of ML models as base models, while Bagging as well as Boosting are homogenous ensemble techniques i.e same ML models have to be used as base models.
In the Stacking ensemble technique, an algorithm takes the outputs of sub-models as input. It attempts to learn the way to best combine the input predictions to output better predictions by considering the stacking procedure at two levels:
- In level 0, data is the training dataset inputs. Models in this level learn to make predictions from this data.
- In level 1, the output of the level 0 models is taken as input and the single level 1 model, or meta-learner, learns to make predictions from this data.
Base-models are often complex and diverse and as such, it is often a decent idea to use a range of models that make very different assumptions about how to solve the predictive modeling task. Ensemble algorithms such as AdaBoost or Random Forests may themselves be used as base models. While the meta-model is often simple, it provides a smooth interpretation of the predictions made by the base models. In general, linear models are often used as the meta-model, though that is not a hard requirement.
Let’s understand Stacking using an example, and the steps involved in it:
1 – Suppose we have a train set and test set and we have split the train set into 10 parts.
2 – A decision tree is chosen as a base model and fitted into 9 parts and predictions are made for the 10th part and this is done for each part of the train set. In simple words, k-fold cross-validation is used to train the base model.
3 – The base model is then fitted on the whole train dataset and predictions are made on the test set.
4 – Steps 2 and 3 are repeated for another base model, let’s say for SVM, resulting in another set of predictions for the train set and test set.
5 – The predictions from the train set are used as features to build a new model and this model is used to make final predictions on the test prediction set.
There is also an ensemble learning technique known as Blending. It follows the same approach as stacking but it uses only the validation (or holdout) set from the train set to make predictions i.e. instead of using predictions on the train set as features, it uses predictions on the validation set as features of the final meta-model.
The general approach of Blending:
- The whole train set is split into a training set and a validation set.
- Then heterogeneous base models are trained on the training set.
- Make predictions only on the validation set and the test set.
- The validation predictions are used as features to build a new model.
- This model is used to make final predictions on the test set using the prediction values as features.
This was a quick introduction to the Stacking and Blending ensembling technique. And this concludes our Ensemble Learning Series.