Written in the book of the former

The article has a lot of reprint, I saw, the important place because of the format problem, reprint the site is not adapted, please stamp the original link to see the original, there is the most complete description.

LightGBM, XGBoost as a very classic GBDT model, online principle and actual code are a lot of. But I looked at a few formulas and wrote a few lines of code. Is not always feel empty in the heart. Until an interviewer asked. Give you a bunch of data to process with the GBDT model. How does this data work inside the model, and how do we get to the final answer? Although you can stutter out the answer, but feel that the interviewer hit the nail on the head, in fact, this question is the process of learning the real need to understand. A variety of information, mainly based on XGBoost, to give you a statement about this problem. You can click this link to see some of XGBoost’s core parameters. For the principle part, there is GBDT principle explanation. Of course, there are explanations of how XGBoost works, and IT is recommended to start with GBDT.

I illustrate the process in terms of steps

  1. Set some big goals for the learnerGeneral parametersSupport. useboosterDefine whether we take a tree or linear approach as the structure of the training. usesilentDefine whether the system should output training content, such as tree depth and leaf count, in real time. usenthreadTo set how many threads are used to run the model.
  2. The next step is to learn the goal parameters. We’re going to defineobjectiveAre we going to use this model for regression or classification? Polyclassification or dichotomy? That’s the overall goal. The following iseval_metricThis is our evaluation function and our loss function. As we all know, the direction of model optimization is in the direction in which the value of the loss function decreases, so this is important. In the original xGBoost article, our target function in XgBoost consists of two parts: the first part is thisLoss function. The second part isRegularization functionMore on that later.

  1. With loss functions, we also need to define regularization parameters. Regularization functions generally measure the complexity of the tree. Regularization principles can be found in this link:

I need three parameters here,gamma.alpha.lambda. It’s worth noting that I looked it upXgboost original papersAnd various blogs, and found nothing aboutIt’s described in a mathematical formula. But it does make sense and exist as an L1 regularization parameter. I guess it’s probably zero for the most part, so I don’t pay much attention to it in the formula. I’m bold enough to add it here.

First, a tree hasOne leaf node, thisThe values of the leaf nodes form oneDimensional vector, which respectively correspond to the regularization function above.

I am(gamma) This parameter is explained in detail, this paragraph is interested to see

gammaAs aThe regularization parameter, which exists to letAs small as possible,gammaThe larger the value, the more conservative the algorithm. The original parameter description says,gammaIs the minimum loss reduction required to further divide the leaves of the tree. I’ll try not to use too many formulas to explain this description: we see that regularization functions can actually be written like this:


Xgboost splits the node if the sum of the target function values of the left and right byte points is smaller than the parent node. There’s a difference here, where the objective function of the two children minus the objective function of the parent is going to be greater than zero. Where gamma is a constant, we have:



theThe delta function represents the target function minus the constantPart.Is the difference in the objective function, which is essentially contributed by the reduction in the loss function. I have made a simple elaboration here, you can see the detailsThe original paperExplain it more clearly.If you have any questions or comments, please feel free to comment in the comments section.

  1. Now that you’ve determined the target function, let’s train with the data. We trained the basic use of GBDT as the core to train.
  • First, perform a CART training on this part of data. After training a result, a tree training must be inaccurate. So after the evaluation function evaluates the error, it enters the second tree generation process.
  • The target of the second tree is not the original target, but the residual between the target and the predicted value of the previous tree. The method of picking residuals is named GBDT in the same way, taking the negative gradient of the objective function as the learning objective of the second tree.
  • The learning objective of the third tree is the total residual of the learning of the first two trees. In this way, in theory, our total error is decreasing step by step.
  • .
  • After n iterations of this, there is an_estimatorThe parameter limits the maximum number of spanning trees and the maximum number of iterations. N stops generating new trees at this number.
  • Add up the tree values of each previous training as the final training result. And we’ll see that by adding up the values of each tree, we’re essentially adding a negative gradient from the original objective function each time, which is the same thing as subtracting a gradient. So the optimization process of GBDT is actually the principle of gradient descent. This is probably the core of GBDT.

In this training process, we used: Learning_rate: the gradient descent step of each iteration; max_depth: the maximum tree depth; colsample_bytree: the proportion of column numbers randomly selected during each tree generation; colsample_bylevel: This is more detailed than the previous one, which refers to the percentage of columns sampled for each node split in each tree. Subsample: The percentage of random samples taken during each training session. Others I don’t use very often, such as max_delta_step: this parameter limits the maximum stride size of each tree weight change.

The above are some core parameters, more commonly used.

This article focuses on the way of tree generation, the core things out, the interviewer should not be dissatisfied with it.