This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 25th article in our machine learning series about AdaBoost.

We have studied several models so far, and there are three algorithms for generating optical decision trees. But every time we classify, every time we use a model for training and prediction. When we make a decision, we often consult several people and adopt their opinions comprehensively. Is it possible to apply this idea to machine learning and create multiple models to synthesize the results?

Of course it can, and it’s called the ensemble method.

Integration method

The integration method itself is not a specific method or algorithm, but an idea to train the machine learning model. It simply means training multiple models and aggregating their results.

Based on this idea, three specific methods have been derived in the industry, namely Bagging, Boosting and Stacking.

Bagging

Bagging stands for bootstrap aggregating, and it’s hard to understand what it means literally. Let’s just remember the name. In the Bagging method, we create K data sets by putting back random samples. For each data set, some individual samples may appear repeatedly, and some samples may never appear, but overall, the probability of occurrence of each sample is the same.

After that, we train K models with the K data sets that we sampled. There are no restrictions on the models here, and we can use any machine learning square model. K models will naturally get K results, so we adopt democratic voting to aggregate these K models.

For example, suppose K is 25, in a dichotomy problem. Ten of the models predicted 0 and 15 predicted 1. Then the final prediction result of the whole model is 1, which is equivalent to the democratic voting of K models, and each model has the same voting power. The famous Random Forest takes this approach.

Boosting

Boosting’s approach is very similar to Bagging’s in that they have the same sampling logic for samples. The difference is that in Boosting, these K models are not simultaneously trained, but serially trained. During training, each model will pay more attention to the samples incorrectly judged by the previous model based on the results of the previous model. Again, the sample will have a weight, and the sample with a higher misjudgment rate will have a higher weight.

And each model will be given a different weight according to its capabilities, and all models will be summed up in a weighted way, rather than a fair vote. Due to this mechanism, the efficiency of the model in training is also different. Since all Bagging models are completely independent, we can adopt distributed training. In Boosting, each model will depend on the effect of the previous model, so it can only be trained in sequence.

Stacking

Stacking is often used in Kaggle competitions and the idea is very simple. We select K different models, and then train and predict on the training set by means of cross-validation. Ensure that each model produces a prediction for all training samples. So for each training sample, we get K results.

Then we create a second layer model whose training features are these K results. That is, Stacking methods use a hierarchy of models, and the training characteristics of the last layer model are predicted by the upper layer model. It is up to the model itself to train which model results are more worthy of adoption, and how to combine strengths between models.

AdaBoost, which we introduce today, is, as its name implies, a classical Boosting algorithm.

Model train of thought

The core idea of AdaBoost is to build a strong classifier through some weak classifiers by Boosting.

Strong classifiers are well understood by us. They are models with strong performance. So how to understand weak classifiers? The strength of the model is actually defined relative to the random results. The better the model with the random results, the stronger its performance. From this point of view, a weak classifier is one that is only slightly better than a random result. Our goal is to synthesize the results of these weak classifiers into the effects of strong classifiers by designing the weights of samples and models so that the best decisions can be made.

First, we will assign a weight to the training sample. At the beginning, the weight of each sample is equal. According to the training sample, a weak classifier is trained and its error rate is calculated. Then train the weak classifier again on the same data set. In the second training, we will adjust the weight of each sample. The correct sample weight will decrease, and the wrong sample weight will increase.

Also, each classifier is assigned a weight value, the higher the weight, the greater the discourse power. theseIt’s based on the error rate of the model. Error rateDefined as:


The D here stands for the data setRepresents the set of misclassified samples, which is equal to the number of misclassified samples divided by the total number of samples.

With the error rate, we can get the following formula.


gotThen, we use it to update the weight of the sample, where the weight with correct classification is changed as:


The weight of the misclassified sample is changed to:


In this way, all of our weights are updated, which completes the iteration. AdaBoost iterates and adjusts weights until the training error rate is zero or the number of weak classifiers reaches a threshold.

Code implementation

First, we get the data. Here we choose the breast cancer prediction data in the SKLearn dataset. As in the previous example, we can use import directly, which is very convenient:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer()
X, y = breast.data, breast.target 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y = y.reshape((- 1.1)) Copy the code

Next, we split the data into training data and test data. This is also a common practice, and there is no difficulty:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
Copy the code

In the AdaBoost model, the weak classifier we choose is the stump of the decision tree. A tree stump is a decision tree with a depth of one. Tree depth of 1 is obviously not going to be particularly good no matter how we choose the threshold, but since we still choose the threshold and the feature, the result is not too bad, at least better than random selection. So this guarantees that we can get a weak classifier that works slightly better than random selection, and it’s very simple to implement.

Before we implement the model, let’s implement a few helper functions.

def loss_error(y_pred, y, weight):
    returnweight.T.dot((y_pred ! = y_train))
def stump_classify(X, idx, threshold, comparator):
    if comparator == 'lt':
 return X[:, idx] <= threshold  else:  return X[:, idx] > threshold  def get_thresholds(X, i):  min_val, max_val = X[:, i].min(), X[:, i].max()  return np.linspace(min_val, max_val, 10) Copy the code

These three functions should be easy to understand, and in the first function we calculated the error of the model. Since each of our samples has its own weight, we take a weighted sum of the errors. The second function is the prediction function of the stump classifier. The logic is very simple, comparing the size according to the threshold value. There are two cases here, it’s possible that the sample below the threshold is positive, or it’s possible that the sample above the threshold is positive, so we also need a third parameter to record this information. The third function is a threshold generating function. Since we do not need the performance of the stump to be particularly good, we do not need to traverse all the threshold values. We can simply divide the range of features into 10 segments.

Next comes the generating function of a single tree stump, which is equivalent to the function of data splitting by selecting features in a decision tree. The logic is basically the same, with only minor modifications.

def build_stump(X, y, weight):
    m, n = X.shape
    ret_stump, ret_pred = NoneAnd []    best_error = float('inf')

 # enumerate features  for i in range(n):  # enumerate thresholds  for j in get_thresholds(X, i):  # enumerate two cases  for c in ['lt'.'gt'] : # Predict and calculate the error  pred = stump_classify(X, i, j, c).reshape((- 1.1))  err = loss_error(pred, y, weight)  # Record the best stumps  if err < best_error:  best_error, ret_pred = err, pred.copy()  ret_stump = {'idx': i, 'threshold': j, 'comparator': c}  return ret_stump, best_error, ret_pred Copy the code

The next thing to do is repeat the operation of generating the stump, calculationandAnd update the weight of each sample. The whole process is not too difficult, basically according to the implementation formula:

def adaboost_train(X, y, num_stump):
    stumps = []
    m = X.shape[0]
    The sample weights are initialized so that they are all equal at first
    weight = np.ones((y_train.shape[0].1)) / y_train.shape[0]
 Generate num_stump for stumps  for i in range(num_stump):  best_stump, err, pred = build_stump(X, y, weight)  # calculated alpha  alpha = 0.5 * np.log((1.0 - err) / max(err, 1e-10))  best_stump['alpha'] = alpha  stumps.append(best_stump)   Update the weight of each sample  for j in range(m):  weight[j] = weight[j] * (np.exp(-alpha) if pred[j] == y[j] else np.exp(alpha))  weight = weight / weight.sum()  # If the current accuracy is very high, exit  if err < 1e-8:  break  return stumps Copy the code

After the stump generation is complete, the final part is the prediction part. The whole prediction process is still very simple, a weighted summation process. It should be noted that we maintained the weight of samples during training in order to highlight incorrectly predicted samples and make the model have better capabilities. However, during the prediction, we do not know the weight of the predicted sample, so we only need to weight the results of the model.

def adaboost_classify(X, stumps):
    m = X.shape[0]
    pred = np.ones((m, 1))
    alphs = 0.0
    for i, stump in enumerate(stumps):
 y_pred = stump_classify(X, stump['idx'], stump['threshold'], stump['comparator'])  # Alpha weighted summation  pred = y_pred * stump['alpha']  alphs += stump['alpha']  pred /= alphs  Classify 0 and 1 according to 0.5  return np.sign(pred).reshape((- 1.1)) Copy the code

At this point, our whole model is completed. Let’s first look at the performance of a single tree stump on the training set:


As you can see, the accuracy is only 0.54, just a little better than random prediction.


However, when we combined the results of 20 stumps, we could get an accuracy of 0.9 on the training set. On the prediction set, it did better, with an accuracy of nearly 0.95!


This is because in AdaBoost, every classifier is a weak classifier, and it has no fitting ability at all. After all, it performs poorly in the training set, which ensures that what the classifier learns is a real generalization ability, applicable to the training set and also applicable to the test set with a high probability. This is one of the biggest advantages of the integration approach.

conclusion

Integrated methods can be said to be a very important leap in the field of machine learning. The emergence of integrated methods greatly reduces the difficulty of designing a strong classifier, and also ensures the effect of the model.

Because in some fields, it may be very difficult to design a strong classifier, but it is much easier to design a weak classifier. In addition, the model itself has good performance and is not easy to fall into overfitting. Before deep learning models became popular, integrated methods were widely used, and almost all the champions of machine learning competitions used integrated learning.

The specific ideas of integrated learning may be different, but the core ideas are the same. Once we understand AdaBoost, it’s much easier to move on to other integration models.

If you like this article, if you can, please click the following, give me a little encouragement, but also convenient access to more articles.

This article is formatted using MDNICE