Improve your model with voting, bagging, lifting, and stacking

Photo: Justin RoyonUnsplash

Set learning is a technique used in machine learning that combines multiple models into a group model, in other words, a _ set model _. The goal of a collection model is to perform better than each model individually, or if not, at least as well as the best single model within the group.

In this article, you will learn about popular collection methods. Voting, bag method, ** pressure boost. ** and ** stacks ** and their Python implementations. We will use the library, such as [scikit – learn] (https://scikit-learn.org/stable/index.html) used for voting, bagging and ascension, And [mlxtend] (http://rasbt.github.io/mlxtend/) is used to stack.

In the meantime, I encourage you to check out my Jupyter laptop on GitHub for the full analysis and code. 🌻

Introduction to the

The intuition behind collective learning is often described as a phenomenon called “crowd intelligence,” which means that collective decisions made by a group of people are often better than individual decisions. There are several ways to create a set model (or aggregate), which we can divide into heterogeneous and homogeneous aggregates.

In the heterogeneous set, we combine multiple heterogeneous sets trained on the same data set with multiple fine-tuning models trained on the same data set to generate a set model. This approach usually involves _ vote _, _ average _, or _ stack _ techniques. On the other hand, in a homogeneous aggregate, we use a ** identical ** model, which we call a “weak model”, and transform this weak model into a strong model by using techniques such as _ bagging _ and _ lifting _.

Let’s start with the basic collection learning method for heterogeneous collections. The voting method and the averaging method.

1. Voting (hard voting)

A hard vote set is used for categorization tasks and combines the predictions of multiple fine-tuning models trained on the same data based on majority voting principles. For example, if we aggregate three classifiers that are predicted as “class A”, “Class A”, and “Class B”, then the aggregate model will output “Class A” according to the majority vote prediction, or in other words, according to the distribution pattern predicted by A single model. As you can see, we tend to have an odd number of individual models (e.g. 3, 5, 7 models) to ensure that we don’t have the same votes.

Hard to vote. Multiple models are used to predict new instances, and the final results are voted by a set of majority votes

# Instantiate individual models
Copy the code

Accuracy score.

Neighborsclassifier 0.93 LogisticRegression 0.92 DecisionTreeClassifier 0.93 VotingClassifier 0.94 ✅

As we can see, the voting classifier is the most accurate! Since the collection will combine the predictions of each model, each model should have been fine-tuned and performed well. In the above code, I just initialized it for demonstration purposes.

2. Equalization (soft voting)

Soft voting is used for classification and regression tasks, and it combines the predictions of multiple fine-tuning models trained on the same data by averaging. For classification it uses the probability _ of the prediction _ and for regression it uses the value _ of the prediction _. We don’t need to have an odd number of single models like hard voting, but we need at least two models to build a set.

Soft vote. New examples are predicted using an equal-weight model (W), and the ensemble selects the final result by averaging – images courtesy of the author

One advantage of soft voting is that you can decide whether you want to weigh each model average-weighted (average) or weighted by the importance of the classifier, which is an input parameter. If you prefer to use a weighted average, then the output prediction of the set model will be the maximum sum of the weighted probabilities/values.

# Instantiate individual modelsreg1 = DecisionTreeRegressor()reg2 = LinearRegression()
Copy the code

Average absolute error.

DecisionTreeRegressor 3.0

LinearRegression 3.2

VotingRegressor 2.5 ✅

It is important to understand that the performance of voting aggregates (hard and soft votes) depends heavily on the performance of individual models. If we aggregate a good model with two models that behave moderately, the aggregate model will show results close to the average model. In this case, we either need to improve the average performance model, or we should not aggregate and use the good performance model. 📌

Now that we know about votes and averages, we can move on to the last heterogeneous aggregation technique. The stack.

Stack 3.

Pile-up means “pile-up generalization”, which combines multiple individual models (or base models) with a final model (or metamodel) trained with the predicted values of the base model. It can be used for classification and regression tasks, with the option of using numerical values or probabilities for classification tasks.

Unlike voting sets, in a stack, the metamodel is also a trainable model, in fact, it is trained with the predicted values of the base model. Since these predictions are input features of the metamodel, they are also referred to as metamodel features. We can choose to include the initial data set in the meta-features or use only the predicted results.

Stack. The prediction of the benchmark model is used in the training of the metamodel to predict the final output – images provided by the author

Stacking can be done with more than two layers: a multi-layer stack, where we define the base model, aggregate it with another layer of models, and then the final metamodel. Even if this produces better results, the time cost due to complexity should be taken into account.

To prevent overfitting, we can use cross-validated stacks instead of standard stacks, both of which are implemented by the MLXtend library. Now, I’m going to implement.

1. Standard stacking of classified tasks

from mlxtend.classifier import StackingClassifier
Copy the code

Neighborsclassifier 0.84 GaussianNB 0.83 DecisionTreeClassifier 0.89 LogisticRegression 0.85 StackingClassifier 0.90 ✅.

2. Stack regression tasks using cross validation

from mlxtend.regressor import StackingCVRegressor
Copy the code

Average absolute error.

DecisionTreeRegressor 3.3 SVR 5.2 LinearRegressor 3.2 StackingCVRegressor 2.9

4. DaiZhuangHua

Bootstrap aggregation, or “Bagging “for short, aggregates multiple estimators that use the _ same _ algorithm and are trained with different subsets of training data. Bagging uses the bootstrap method to create randomly sampled training data for each estimator.

Bootstrapping is a method of creating replacement samples from raw data. It uses substitution so that each data point is equally likely to be selected. Due to selection by substitution, some data points may be selected multiple times, while some data points may never be selected. We can compute the probability that a data point will not be selected in a boot sample of size n using the following formula. (Ideally, n is a large number).

This means that each bagging estimator is trained with approximately 63% of the training data set, and we call the remaining 37% the out-of-bag (OOB) sample **. 支那

In summary, Bagging extracts _n _ training data sets from the original training data of _n _ estimators using substitution method. Each estimator performs parallel training on its sampled training data set to make predictions. The bag method then aggregates these predictions using techniques such as hard or soft voting.

Bagging. The estimator uses a combination of Bootstrapped training samples and predictive and voting techniques – image by the author

In Scikit-Learn, we can define the parameter n_estimators, equal to _n_– the number of estimators/models we want to generate, and ooB_score, which can be set to “true” if we want to evaluate the performance of each estimator on its out-of-pocket sample. By doing this, we can easily learn how the estimator performs on previously unseen data without using cross validation or separate test sets. The oob_score_ function calculates the average of all noob_scores. By default, metric accuracy scores are used for classification and R^2 for regression.

from sklearn.ensemble import BaggingClassifier
Copy the code

Score: 0.918

# Compare with test setpred = clf_bagging.predict(X_test)print(accuracy_score(y_test, pred))
Copy the code

Accuracy _ Score: 0.916

The randomly sampled training data sets make training less prone to bias on the original data, so the bagging technique reduces the variance of a single estimator.

A very popular bagging technique is random forest, where the estimator is chosen as the decision tree. Random forest uses the bootlift method to create training data sets with replacements, and it also selects a set of features (without replacements) to maximize the randomness of each training data set. In general, the number of features selected is equal to the square root of the total number of features.

5. Lifting method

The improvement method uses progressive learning, which is an iterative process with the emphasis on minimizing the errors of the previous estimator. It is a continuous method in which each estimator relies on the previous estimator to improve the prediction results. The most popular methods of ascension are adaptive ascension (AdaBoost) and gradient ascension.

AdaBoost uses the entire training data set for each _n _ estimator, with some important modifications. The first estimator (weak model) is trained with equally weighted data points on the original data set. After the first prediction is made and the errors are calculated, the data points that are wrongly predicted are given a higher weight than the data points that are correctly predicted. By doing so, the next estimator will focus on these instances that are hard to predict. This process will continue until all of the _n _ estimators (say 1000) have been trained in turn. Finally, the aggregate prediction will be obtained by weighted majority voting or weighted average method.

AdaBoost. Sequential model training for weight updating in training data – image from author

from sklearn.ensemble import AdaBoostRegressor
Copy the code

RMSE: 4.18

Gradient lifting, very similar to AdaBoost, improves the previous estimator by sequential iteration, but instead of updating the weights of the training data, it ADAPTS the new estimator to the residual errors of the previous estimator. XGBoost, LightGBM, and CatBoost are popular gradient lift algorithms, with XGBoost in particular being the winner of many races and being popular because it is very fast and scalable.

conclusion

In this article, we have studied major ensemble learning techniques to improve the performance of the model. We have covered the theoretical background of each technique, as well as the associated Python libraries to demonstrate these mechanisms.

Ensemble learning is a big part of machine learning, and it’s important to every data scientist and machine learning practitioner. You may find a lot to learn, but I’m sure you’ll never regret it! ! 💯

If you need a refresher on bootstrap, or if you want to learn more about sampling techniques, you can check out my article below.

I hope you enjoyed reading about ensemble learning methods and found it helpful to your analysis!

If you enjoyed this post, you can read my other posts here and follow me on Medium. Please let me know if you have any questions or suggestions. ✨

reference

  1. Further reading for ensemble learning
  2. Mlxtend library