Let’s understand ensemble learning with an example. Suppose you have a startup idea and you wanted to know whether that idea is good to move ahead with it or not. Now, you want to take preliminary feedback on it before committing money and your precious time to it. So you may ask one of your close friends to provide feedback, but it’s possible that your friend doesn’t want to demotivate you by providing negative feedback on your bad idea. You may also ask 5-10 people (having different skills) with whom you are planning to work to evaluate that idea. By doing so you might get a better understanding of your idea, as these people may provide you honest feedback, as you are also asking them for their time to work on that idea. So, the responses would be more generalized and diversified since now you have people with different sets of skills.
Here from this example we clearly see that a diverse group of people are likely to make better decisions as compared to an individual and this is also true for a diverse different set of models in comparison to a single model. This generalization and diversification in ML are achieved by a technique called Ensemble Learning.
This post is the first part of the Ensemble Learning series. In this series, there will be a total of 3 parts, and we will be discussing one ensemble technique in each part: Bagging, Boosting, and Stacking.
In this part, we will discuss the Bagging – ensembling technique and some ML algorithms based on it.
Bagging
Bagging or Bootstrap Aggregation is a simple but very powerful ensemble technique. The idea behind it is combining the results of multiple models to get a generalized and diverse result. But if we fit the same data to create multiple models and combine them, will it be useful? The answer is no, as there is a high probability that these models will output similar results as they are getting the same input. Bagging or bootstrapping is one of the techniques to solve this problem. Bagging helps in reducing variance – we will see how it is able to do that.
Bootstrapping is a sampling technique in which we create subsets of instances from the original dataset and it is done with replacement. Let’s say we have to create 2 subsets each with 100 observations from our data having 200 observations. We will first select 100 observations from our data to make Subset 1, and after working with Subset 1 those selected 10 observations are placed back (hence, “with replacement”) into the data and we then create Subset 2 by selecting 10 observations from our data. The bagging technique uses these subsets to get a fair idea of the distribution. The size of subsets created may be less than the original set.
Bagging is the application of the Bootstrap procedure to high-variance ML algorithms, mainly decision trees. All the models are combined at the end which leads to higher stability and a lower variance compared to the individual models, so we can say that bagging helps to reduce variance.
Steps for bagging
- A subset of features is selected to create a model with a sample of observations and a subset of features, feature from the subset is selected which gives the best split on the training data.
- Repeat this to create many models and every model is trained in parallel.
- Forecast predictions from all trained models.
- Prediction is given based on the aggregation of predictions from all the models.
Some Bagging-based ML algorithms
Bagging meta-estimator
It follows the typical bagging technique to make predictions and can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems.
Steps
- Bootstrapping (Creating random subsets).
- Subsets include all the features.
- A custom base estimator is fitted on all of the subsets.
- Prediction from each decision tree model is combined to get the final result.
Some Important parameters
- base_estimator: the estimator used to fit on random subsets of the dataset.
- n_estimators: number of base estimators in the ensemble.
- max_sample: number of samples to draw from original data to train each base estimator.
- max_features: number of features drawn from the dataset.
- n_jobs: it is the number of jobs to run in parallel.
- random_state: random split.
Random Forest
Random Forest is an extension of the previously mentioned bagging technique, bagging meta-estimator. The base_estimators for this are decision trees and Random Forest selects a feature that is the best split at each node of the decision tree.
Steps
- Bootstrapping (Creating random subsets).
- At each node in the decision tree, instead of considering all features, only a random set of features are considered to decide the best split.
- Construct a decision tree for every subset.
- Prediction from each decision tree model is combined to get the final result.
Some Important parameters
- n_estimators: it is the number of decision trees to be created in a random forest.
- criterion: function to be used for splitting (default=Gini Impurity).
- max_features: maximum number of splits that are required for a split in each decision tree.
- max_depth: maximum depth of the decision trees.
- min_samples_split: it is the minimum number of samples required in a leaf node before a split is attempted.
- min_samples_leaf: it is the minimum number of samples required to be at a leaf node.
- n_jobs: number of jobs to run in parallel.
- random_state: random split number.
This was a quick introduction to the bagging ensembling technique. In the next part, we will explore the boosting ensembling technique, how it works, how it is different from bagging, ML algorithms that are based on it, and its advantages. See you in the next part of this series!! 🙂
Author
Shubham Bindal