1 Model Integration (Ensemble)

I have heard a saying, “Feature is the main Feature, Ensemble is the last”. Feature determines the upper limit of your model, and Ensemble brings you closer to that limit. Ensemble emphasizes “good but different”, which means that models are learned in different aspects. For example, on A math test, if A does better on A function problem than B and B does better on A geometry problem than A, then they usually get A higher score when they work together than when they work alone.

Common Ensemble methods are Bagging, Boosting, Stacking, and Blending.

1.1 Bagging

Bagging is simply a weighted average, or poll, of the predictions of multiple models (base learners). The advantage of Bagging is that base learners can be trained in parallel, and Random Forest uses the Bagging idea. Here’s a simple example:

 

The teacher gave two addition questions, and the weighted answers of student A and student B were more accurate than those of student A and student B respectively.

Bagging usually does not have a specific optimization goal, but there is a method called Bagging Ensemble Selection, which uses greedy algorithms to Bagging multiple models to optimize the target value. In this competition, we also used this method.

1.2 Boosting

Boosting’s idea is a bit like that of learning to correct mistakes. Each base learner is trained to make up for the mistakes made by the previous one. Among them, the famous algorithms include AdaBoost and Gradient Boost. Gradient Boost Tree uses this idea.

Boosting, which I mentioned in 1.2.3 (Error Analysis), is a process similar to Boosting, with error analysis > Feature extraction > training model > error analysis.

1.3 Stacking

The idea of Stacking is to use new models (sublearners) to learn how to combine those base learners, which is derived from the Generalization paper Stacked. If Bagging is viewed as a linear combination of base classifiers, Stacking is a nonlinear combination of base classifiers. Stacking can be flexible and can be used to stack the learner layer by layer to form a mesh structure as shown below:

 

A more intuitive example is the two addition questions:

 

Here A and B can be regarded as fundamental learners, and C, D, and E are all sublearners.

  • Stage1: A and B have their own answers.
  • Stage2: C and D peeked at A and B’s answers. C thought A was as smart as B, D thought A was A little smarter than B. They each combined A and B’s answers and gave their own answers.
  • Stage3: E peeked at C and D’s answers, E thought D was smarter than C, and then E gave her own answer as the final answer.

 

One thing to be careful of when implementing Stacking is to avoid Label leaks. When training the secondary learner, the test result of Train Data of the previous learner is needed as a feature. If we Train on Train Data and then predict on Train Data, we will cause a Label Leak. In order to avoid Label Leak, it is necessary to use k-folds for each learner to put together the predicted results of Valid Set of K models as the input of the next learner. The diagram below:

 

 

As can be seen from the figure, we also need to make predictions for Test Data. There are two options: average the prediction results of K models on Test Data, or retrain a new model with all Train Data to predict Test Data. Therefore, in the implementation process, we had better save the Test results of Train Data and Test Data of each learner for convenient training and prediction.

One more thing with Stacking is that fixing k-folds can minimize Valid Set overfitting, that is, sharing a K-fold globally or, in the case of teamwork, among members. To see why a k-fold is needed, see here.

1.4 Blending

Blending is very similar to Stacking and can be distinguished here