This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 30th article of machine learning topic, we will talk about a machine learning era can be said to be the most powerful model — GBDT.

There is no single best model in machine learning, though. But GBDT was recognized as one of the most effective models before deep learning emerged and became popular. Although it has been known to enter the era of deep learning and artificial intelligence, GBDT has not fallen behind, and it is still widely used in many scenes and companies. This is one of the models that is often asked in an interview.

Unfortunately, there are a lot of information about GBDT on the market, but few people introduce the core essence of it clearly. Beginners are often confused by such perplexing concepts as “gradient” and “residual” at the beginning of learning, delaying the learning and understanding of algorithm principles. But in fact, the overall principle of GBDT is more intuitive and simple, as long as we find the right method, seize the core, I believe that for the vast majority of people, should not be a problem.

GBDT basic concepts

Boosting Decision Tree (Gradient Boosting Decision Tree) From its English expression, we can see that GBDT is based on decision tree. Decision trees have been discussed in detail in previous articles, but we won’t go into them here. The CART algorithm of decision tree is mainly used in GBDT. In the CART algorithm, we choose a feature every time and find a threshold value for dictation. According to the threshold, the sample is divided into two parts: those less than or equal to the threshold and those greater than the threshold. In CART tree, the same feature can be used repeatedly, but other similar ID3 and C4.5 do not have this property.

Boosting, another key word, is Boosting. Boosting is a training method for integrated models, which we discussed earlier in the introduction of AdaBoost model. Its biggest characteristic is that it can train multiple models and reduce the deviation of the whole model through continuous iteration. For example, in the Adaboost model, multiple weak classifiers will be set up, and we will give different weights to them according to their performance. By this design, the classifier with good effect can have high weight as much as possible, so as to ensure the fitting ability of the model.

But GBDT’s Boosting method is different in that it is an addition model composed of multiple CART decision regression trees. We can simply understand that the final prediction result of the whole model is the sum of the prediction results of all regression trees. Understanding this is very important for understanding gradient and residual later.

We can try to write the prediction formula for GBDT:

M in the formula represents the number of CART trees, represents the prediction result of the ith regression tree for the sample, and represents the parameters in each regression tree. So the whole process is just like I said before, the final result of the GBDT model is the sum of all the predicted results of the regression tree.

But then there’s a problem, if it’s regression that’s fine, what if it’s classification? Is it possible to add up the categories?

And it works, and we know that in logistic regression, the formula that we use is, the result of this formula is the probability that the sample is of class 1. We can’t fit the probability directly, of course, but we can fit the result by summing it up, and then we get the probability indirectly.

In today’s article, we mainly explain the regression problem first, because its formula and understanding is the most intuitive and simple. The question of categorization will be left to the next article, so here’s a quick look at it.

Gradients and residuals

Now we are going to introduce the concepts of gradient and residuals, and we are going to review the use of gradient descent in linear regression.

In linear regression, we use gradient descent to find the best parameters and minimize the loss function. In fact, most current models do this, calculating gradients in order to adjust parameters. But GBDT is different, the gradient is calculated for the next iteration, this sentence is very important, must understand.

So let’s say, for example, if we use linear regression to fit a value, the target y here is 20. We currently have 10, so we should calculate the gradient to adjust the parameter, obviously make it higher to reduce the deviation.

But GBDT doesn’t work that way, and again, assuming that our first regression tree gets 10, which is 10 different from the real one, let’s calculate the gradient as well. In regression problems, we usually use mean square error (MSE) as the loss function, so we can calculate the gradient of this function. Let’s first write the formula for the loss function:

The negative gradient of L with respect to theta is exactly equal to, looks like it’s exactly what we want to predict minus what our model predicted. This value is what we call the residual.

We use to represent the training target of sample I for the MTH regression tree, and its formula is:

Intuitively it’s very simple, we want to predict 20, the first tree predicted 10, there’s still 10 left, so we use the second tree to approximate. The second tree predicted 5, the difference became 5, and we went on to create the third tree…

Until we get very close to less than our threshold, or the number of subtrees reaches an upper limit, the training of the model stops.

It should be noted that the residual cannot be simply understood as the difference between the sum of the target values. It is essentially obtained by calculating the negative gradient of the loss function.

The training process

Let’s go through the whole process of model training again, tying all the pieces together.

First, let’s specify a few parameters. M is the number of decision trees. Represents the whole after the m round of training, which is the GBDT model of the final output.

  1. Initialize the

    First, we create the first regression tree, that is, in the regression problem, it is the result of directly fitting the target value with the regression tree, so:

  2. The iteration

    I. For regression trees from the 2nd to the m, we need to calculate the training objective of each tree, which is the residual of the previous results:

    Ii. For the current MTH subtree, we need to traverse its feasible segmentation points and thresholds to find the parameters corresponding to the optimal predicted value C, so as to approximate the residual as possible. We write the formula:

    Here refers to the set of predicted values of leaf nodes in all partitioning methods of the MTH subtree, that is, the predicted values that the MTH regression tree may achieve. Where the range of j is 1,2,3… J.

    And then we update, where I is a function, if the sample lands on the node, then I=1, otherwise I=0.

  3. And finally we have the regression tree

The above formula looks a little complicated, but it’s just a representation of the regression tree by means of and I. Because what we hope to get after training the model is actually the parameters of the model, the parameter representation of the regression tree is quite complicated, so it may look confusing.

We can understand it simply, GBDT is to use the addition model to train multiple regression trees, and the predicted result is the sum of these regression trees. The training objective of each regression tree is the residual of the previous model.

Shrinkage

Shinkage is an optimization method to avoid GBDT falling into overfitting. The essence of this method is to reduce the degree of residual convergence of each iteration, and it is believed that the effect of multiple convergence is better than the result of one approximation with many approximations and fewer approximations. The specific performance measure is to multiply the results of each regression tree by a parameter similar to the learning rate, which can be compensated by increasing the number of regression trees.

In plain English, it’s the same thing when we multiply the learning rate in gradient descent, except in gradient descent, we clearly know that if we don’t multiply the learning rate, we’ll get into a problem of oscillation and convergence. In GBDT, there is no explicit proof or perceptual knowledge of the Shrinkage mechanism; rather, its effect is more empirically based.

Let’s write the equation for Shrinkage as a comparison:

These are the parameters for our Shrinkage, usually between 0.001 and 0.01.

conclusion

Here, the basic principle of GBDT model is introduced. If you’ve read the previous articles on decision trees, understanding GBDT shouldn’t be too difficult for you. If you haven’t read or missed the previous article, check out the related reading section at the end of this article to review what you’ve read.

GBDT’s biggest innovation is to transform the traditional process of adjusting parameters to reduce the gradient into the creation of a new tree model for approximation. I was deeply impressed when I saw it for the first time. Compared with the traditional model, GBDT is less likely to fall into over-fitting because it integrates the results of multiple classifiers, and has better fitting effect for some complex scenes. What we are going to do today is just the most basic regression problem, and in the classification problem, the formula is a little bit different, which we will cover in the next article.

This is the end of today’s article, if you like this article, please invite a wave of quality sanlian, give me a little support (follow, forward, like).

reading

Machine learning — Decision tree CART algorithm, one of the top ten data mining algorithms

Machine learning – Opens the door to an integrated approach that takes you hand in hand to implement the AdaBoost model

GBDT open source: https://github.com/RRdmlearning/Machine-Learning-From-Scratch/tree/master/gradient_boosting_decision_tree