Decision tree is a very common and excellent machine learning algorithm. It is easy to understand and interpretable. It can be used as a classification algorithm and also used in regression model. This paper will be divided into three parts to introduce decision trees. The first part introduces basic trees (including ID3, C4.5 and CART), the second part introduces Random Forest, Adaboost and GBDT, and the third part introduces Xgboost and LightGBM. \

Chapter two: Random Forest, Adaboost, GBDT

This paper mainly introduces the decision tree based on ensemble learning, which produces base learners through different learning frameworks and synthesizes the prediction results of all base learners to improve the recognition rate and generalization of single base learners. \

Integrated learning

There are three common integrated learning frameworks: Bagging, Boosting, and Stacking. There are some differences between the three integrated learning frameworks in the way the base learner is generated and the results are combined, so let’s start with a brief introduction.

1.1 Bagging

Bagging stands for Bootstrap aggregating. Some of the most common front-end frameworks for Bootstrap gating come from Bootstrap sampling, where each base learner samples training sets to produce sub-training sets. The more famous sampling method is 0.632 self-help method. Each base learner is trained based on different sub-training sets, and the final prediction result is obtained by combining the predicted values of all base learners. Bagging’s favourite aggregator is voting, with the most votes in the forecast category.

1.2 Boosting

Boosting training is a step-like process, and the training of base models is sequential. Each base model will learn on the basis of the previous base model, and the final prediction result will be generated by integrating the predicted values of all base models, with weighting method being more commonly used.

1.3 Stacking

Stacking is to train the base model with all the data, and then each base model makes predictions for each training sample. The predicted values are used as eigenvalues of the training sample, resulting in a new training sample. The model is then trained based on the new training sample, and the final prediction result is obtained.

So why is integrated learning better than a single learner? There may be three reasons:

  1. Training samples may not be able to select the best single learner, so they are simply combined together.
  2. It is assumed that the best learner can be found, but the optimal solution can only be found because of the limitation of algorithm operation, so ensemble learning can make up the deficiency of algorithm.
  3. Maybe the algorithm can’t get the optimal solution, but ensemble learning can get the approximate solution. For example, the optimal solution is a diagonal line, and the result obtained by a single decision tree can only be parallel to the coordinate axis, but ensemble learning can fit this diagonal line.

Deviation and variance

In the last section, we introduced the basic concepts of ensemble learning. In this section, we will focus on understanding ensemble learning from the perspective of bias and variance.

2.1 Bias and variance of ensemble learning

Bias describes the difference between a predicted value and a true value; Variance describes the degree of dispersion of predicted values as random variables. Here’s a classic picture:

The deviation and variance of the model

  • Deviation: describes the gap between the expected predicted results of the model fitted by the sample and the real results of the sample. In order to perform the deviation well, it is necessary to complicate the model and increase the parameters of the model, but in this way, it is easy to overfit, and the overfit corresponds to the High Variance in the figure above, and the points will be scattered. The points corresponding to the low deviation are all hit near the bull’s eye, so the meow is very accurate, but not necessarily stable.
  • Variance: describes the performance of the model trained on the sample on the test set. In order to perform the variance well, it is necessary to simplify the model and reduce the complexity of the model. However, it is easy to underfit, which corresponds to High Bias in the figure above, and the point deviates from the center. Low variance correspondence means that all the points are very concentrated, but not necessarily near the bull ‘s-eye. The hand is very steady, but not necessarily aiming accurately.

It is often said that the base model in ensemble learning is weak model. Generally speaking, weak model is the model with high deviation (low accuracy in training set) and small variance (strong ability to prevent over-fitting). However, not all the base models in integrated learning frameworks are weak models. The base model in Bagging and Stacking is a strong model (with low bias and high variance), while the base model in Boosting is a weak model.

In Bagging and Boosting frameworks, we can get the overall expectation and variance of the base model by calculating the expectation and variance of the base model. To simplify the model, we assume that the weight, variance and correlation coefficients of the base model are equal. Since both Bagging and Boosting base models are linearly composed, the following equation can be obtained:

Overall expectation of the model:

Overall variance of the model:

2.2 Bias and variance of Bagging

For Bagging, the weight of each base model is equal to 1/m and the expectations are approximately equal (all sub-training sets are subsampled from the original training set). Therefore, we can further simplify to obtain:

We can see from the above equation

  • The expectation of the whole model is equal to the expectation of the base model, which means that the deviation of the whole model is approximately the deviation of the base model.
  • The variance of the whole model is less than or equal to that of the base model. If and only if the correlation is 1, the equal sign is taken. As the number of base models increases, the variance of the whole model decreases, so that the ability to prevent over-fitting is enhanced and the accuracy of the model is improved. But is the accuracy of the model always infinitely close to 1? Not necessarily. When the number of base models increases to a certain extent, the change of the first term of the variance formula has little effect on the overall variance, and the ability to prevent overfitting reaches its limit, which is the limit of accuracy.

Here we know why the base model in Bagging must be a strong model. If Bagging uses a weak model, the overall model will become more biased and less accurate.

Random Forest is a classic model based on Bagging framework. On this basis, feature sampling and sample sampling are introduced to reduce the correlation between base models. In the formula, the second term in the variance formula is significantly reduced, while the first term is slightly increased, so as to reduce the overall variance of the model.

2.3 Boosting bias and variance

For Boosting, the training set sampling of the base model is strongly correlated, so the correlation coefficient of the model is approximately equal to 1. Therefore, we can also simplify the Boosting formula as follows:

By observing the expression of the overall variance, we can easily find:

  • The variance of the whole model is equal to that of the base model. If the base model is not a weak model, its variance is relatively large, which will lead to a large variance of the whole model, that is, the effect of preventing over-fitting cannot be achieved. Therefore, the fundamental model in Boosting framework must be a weak model.
  • In addition, Boosting framework adopts forward addition based on greedy strategy, and the expectation of the whole model is accumulated by the expectation of the base model. Therefore, with the increase in the number of base models, the expectation of the whole model increases and the accuracy of the whole model improves.

In the Gradient Boosting Decision Tree model based on Boosting framework, the base model is also a Tree model. With Random Forrest, we can also carry out Random sampling of features to reduce the correlation between the base models, so as to achieve the effect of reducing variance.

2.4 summary

  • We can use the deviation and variance of the model to approximate the accuracy of the model.
  • For Bagging, the deviation of the overall model is similar to that of the base model, while the variance of the overall model decreases with the increase of the model, so the base model needs to be a strong model.
  • For Boosting, the variance of the whole model is approximately equal to that of the base model, while the deviation of the whole model is accumulated by the base model, so the base model needs to be a weak model.

Random Forest

Random Forest, create a Forest in a Random way. RF algorithm consists of many decision trees, each of which has no correlation. After the establishment of the forest, when new samples enter, each decision tree will make judgment separately, and then give classification results based on voting method.

3.1 the thought

Random Forest is an extended variant of Bagging. On the basis of constructing Bagging integration based on decision tree learner, Random Forest further introduces Random feature selection in the training process of decision tree. Therefore, RF can be summarized into four parts:

  1. Randomly selected samples (put back sampling);
  2. Randomly selected features;
  3. Construct decision tree;
  4. Random forest voting (average).

The randomly selected samples are similar to Bagging, and Bootstrap self-sampling method is adopted. Randomly selected features mean that features are randomly selected for each node in the process of splitting (different from random selection of features for each tree).

This randomness leads to a slight increase in the deviation of the random forest (compared to a single non-random tree), but due to the “average” nature of the random forest, its variance decreases, and the decrease of variance compensates for the increase of the deviation, so it is a better model in general.

Random sampling Due to the introduction of two sampling methods to ensure randomness, so every tree is the most likely to grow without pruning and there will be no over-fitting.

3.2 the advantages and disadvantages

advantages

  1. It performs well on data sets and has great advantages over other algorithms
  2. It is easy to parallelize and has great advantages in large data sets.
  3. Ability to process high dimensional data without feature selection.

AdaBoost

AdaBoost (Adaptive Boosting), whose self-adaptation lies in that the samples misclassified by the previous basic classifier will be strengthened, and the weighted samples will be used to train the next basic classifier again. At the same time, a new weak classifier is added in each round until a predetermined small error rate or a predetermined maximum number of iterations is reached.

4.1 the thought

The Adaboost iterative algorithm has three steps:

  1. Initialize the weight distribution of training samples, and each sample has the same weight;
  2. Training weak classifier, if the sample classification is correct, its weight will be reduced in the construction of the next training set; Conversely, increase. Train the next classifier with the updated sample set;
  3. All weak classifiers are combined into strong classifiers. After the training process of each weak classifier, the weight of the weak classifiers with small classification error rate is increased, and the weight of the weak classifiers with large classification error rate is reduced.

4.2 details

4.2.1 Loss function

Adaboost model is an additive model, the learning algorithm is a forward step learning algorithm, and the loss function is an exponential function.

Addition model: The final strong classifier is weighted and averaged by several weak classifiers.

Forward distributed learning algorithm: the algorithm uses the results of the previous weak learner to update the training concentration of the next weak learner. The strong learner of round K is:

Loss function:

Using the relation of forward distributed learning algorithm, the loss function can be obtained as follows:

Let, its value does not depend on, and therefore has nothing to do with minimization, but only depends on, with each iteration, plugging this expression into the loss function, which is converted to:

If we solve, we can get:

Put in the loss function and take the derivative with respect to \alpha so that it equals 0, then we get:

Where, is the classification error rate above us.

Finally, update the sample weights. By using and, we can get:

In this way, the sample weight update formula is obtained.

4.2.2 regularization

In order to prevent Adaboost from overfitting, we usually add regularization term, which is usually called learning rate. Defined as v, over the previous iteration of the weak learner

Plus regularization we have:

V ranges from 0 to 1. For the same training set learning effect, a smaller V means that we need more iterations of the weak learner. Usually we use the step size and the maximum number of iterations together to determine the fitting effect of the algorithm.

4.3 the advantages and disadvantages

advantages

  1. High classification accuracy;
  2. The weak learner can be constructed with various regression classification models, which is very flexible.
  3. Overfitting is not easy to occur.

disadvantages

  1. Sensitive to outliers, outliers will get higher weight.

GBDT

Boosting Decision Tree (GBDT) is an iterative Decision Tree algorithm, which is composed of several Decision trees. It can be seen from its name that it belongs to Boosting strategy. GBDT is recognized as an algorithm with strong generalization ability.

5.1 the thought

GBDT consists of three concepts: Regression Decision Tree (DT), Gradient Boosting (GB), and SHringkage (an important evolution).

5.1.1 Regression Decision Tree

It would be a mistake to assume that GBDT has many classification trees (although it can be categorized after adjustment). For classification trees, the addition or subtraction is meaningless (e.g., gender), whereas for regression trees, the addition or subtraction is meaningful (e.g., age). The core of GBDT is to sum the results of all trees as the final result, so the trees in GBDT are regression trees, not classification trees, which is quite important.

When a regression tree branches, it will exhaust each threshold value of each feature to find the best segmentation point, which is measured by minimizing mean square error.

5.1.2 Gradient Iteration (Boosting)

As mentioned above, the core of GBDT is to sum the results of all trees as the final result. Each tree of GBDT updates the target value with the residual obtained by the previous tree, so that the value of each tree adds up to the predicted value of GBDT.

The predicted value of the model can be expressed as:

Is the product of the base model and its weight, and the training goal of the model is to make the predicted value close to the real value, that is, to make the predicted value of each base model close to the part of the real value to be predicted. Because all the base models must be considered simultaneously, the training of the whole model becomes a very complicated problem. So the researchers came up with a greedy solution: train one base model at a time. So, now rewrite the overall model as iterative:

In this way, in each iteration, only one training problem of the base model should be solved: to make F_i(x) approximate the true value y.

For example, if user A is 20 years old and the first tree is predicted to be 12 years old, then the residual is 8, and the second tree uses 8 to learn, assuming that its prediction is 5, then its residual is 3, so it can continue learning.

So what is Gradient? In fact, the residual is actually the reverse gradient of the minimum mean square loss function with respect to the predicted value:

In other words, if the inverse gradient is fitted, this value may lead to a decrease in the square error loss function and an improvement in the accuracy of prediction!

However, it should be noted that GBDT based on residual is prone to be sensitive to outliers. For example:

Obviously, subsequent models will pay too much attention to the fourth value, which is not a good phenomenon. Therefore, in general, the loss function of regression class will replace the square loss function with absolute loss or Huber loss function.

Boosting for GBDT is different from Boosting for Adaboost, in that each step of residual calculation of GBDT actually increases the weight of misclassified samples in an incremental way, while the weight of pairs and partitioned pairs tends to 0, so that the following trees can focus on those misclassified samples.

5.1.13 Shrinkage

Shrinkage’s thinking suggests that it is easier to avoid over-fitting by making incremental steps closer to results than by making a big step at a time. That is, it doesn’t trust every remnant tree.

,  (0 v 1)

Rather than repair errors directly with residuals, they repair bits and pieces, cutting big steps into small ones. In essence, there is a weight for each tree, which is multiplied as a result; as the weight falls, the number of base models grows.

5.2 the advantages and disadvantages

advantages

  1. It can automatically combine features and fit nonlinear data.
  2. Flexible handling of various types of data.

disadvantages

  1. Sensitive to outliers.

5.3 Comparison with Adaboost

The same:

  1. Both are members of Boosting family, which use weak classifiers.
  2. Both use forward distribution algorithms;

Different:

  1. The iteration ideas are different: Adaboost makes up for the deficiency of the model by increasing the weight of the error points (using the error samples), while GBDT makes up for the deficiency of the model by calculating the gradient (using the residual).
  2. Loss functions are different: AdaBoost uses exponential loss, GBDT uses absolute loss or Huber loss function;

reference

What are the differences and connections between GBDT and Adaboost in machine learning algorithms? – Frankenstein’s answer – Zhihu

Highlights from the past

  • All those years of academic philanthropy. – You’re not alone

  • Suitable for beginners to enter the artificial intelligence route and information download

  • Ng machine learning course notes and resources (Github star 12000+, provide Baidu cloud image) \

  • Ng deep learning notes, videos and other resources (Github standard star 8500+, providing Baidu cloud image)

  • Python code implementation of Statistical Learning Methods (Github 7200+)

  • The Mathematical Essence of Machine Learning (Online reading edition) \

Note: If you join our wechat group or QQ group, please reply”Add group

To join Knowledge Planet (4300+ user ID: 92416895), please reply”Knowledge of the planet