Refer to the website

https://www.cnblogs.com/mantch/p/11164221.html
Copy the code

1/ What is XGBoost

Fundamental to Boosting, an iterative tree class algorithm, and an addition model, is An eXtreme Gradient Boosting algorithm. Fast speed, good effect, can handle large-scale data, support multiple languages, support custom loss function and so on. Disadvantages: Xgboost is an open source machine learning project developed by Tianqi Chen et al. Xgboost effectively implements GBDT algorithm and makes many algorithm and engineering improvements. Xgboost is widely used in Kaggle competition and many other machine learning competitions. And achieved good results. Boosting Decision Tree (GBDT) is a Boosting Decision Tree for XGBoost. GBDT is a machine learning algorithm and XGBoost is an engineering implementation of this algorithm. Because xGBoost is essentially a GBDT, but strives for maximum speed and efficiency, it has been called X (Extreme) Gradient vaughted. Both XGBoost and GBDT use boosting integrated learning framework.Copy the code

2/ Xgboost algorithm thought and logic

<1> Definition of xgBoost tree

For example, we're trying to predict how much a family likes to play video games. Considering compared to young and old, the young are more likely to love video games, and the men and women than men are more like video games, so the first according to the age to distinguish between children and adults, and then distinguish by gender is male or female, one by one for each grade in the video game preferences degree. As shown below:Copy the code

In this way, two trees tree1 and Tree2 were trained, similar to the principle of GBDT. The sum of the two trees adds up to the final conclusion. So the boy's predicted score is the sum of the scores at the nodes of the two trees where the child falls: 2 + 0.9 = 2.9. Grandpa's predicted score is also -1 + (-0.9) = -1.9. You might jump to your feet and exclaim, isn't this the same as GBDT? In fact, one of the biggest differences between XGBoost and GBDT is the definition of the objective function, leaving aside some differences in engineering implementation and problem solving. The target function of XGBoost is shown below:Copy the code

This might make some readers dizzy, all these formulas. I am here to do a brief explanation, the specific algorithm details and formula solution please check this blog, very carefully: popular understanding of Kaggle match big kill xgboostCopy the code

<2> Loss function loss()

The core algorithm idea of XGBoost is not difficult, which is basically: <1> continuously add trees, continuously carry out feature splitting to grow a tree, each time adding a tree, in fact, learning a new function F (x), each tree is a function to fit the residual of the last prediction. For example, the initial learner F0 (x) takes the mean value of all the training data to fit the real value. <2> After the training is completed and k trees are obtained, we use this model to predict the score of a sample (unknown data). In fact, according to the characteristics of this unknown sample, a corresponding leaf node will fall into each tree. Each leaf node corresponds to a score <3> In the end, only the scores on the corresponding leaf nodes of each tree need to be added up to obtain the predicted value of the sample. Obviously, the goal is to make the final predicted value as close to the true value as possible (with minimum error) and with maximum generalization ability (to prevent overfitting). Similar to the previous GBDT routine, XGBoost also needs to accumulate the scores of multiple trees to get the final predicted score. In each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted value of the previous tree and the true value.Copy the code

So then, how do we choose what new function f(x) to add in each round? The answer is pretty straightforward, pick a f() to reduce our target function as much as possible. Here f() can be approximated using Taylor's expansion formula.Copy the code

In essence, the distribution of samples to leaves corresponds to an OBJ, and the optimization process is OBJ optimization. That is to split nodes into different combinations of leaves, and different combinations correspond to different OBJ. All optimization is carried out around this idea. So far we have discussed the first part of the objective function: the training error (or loss function). Next we will discuss the second part of the objective function: the regular term, which is how to define the tree complexity.Copy the code

<3> Regular term: defines the complexity of the tree

The complexity of trees in XGBoost includes two parts: <1> The number of leaf nodes in the tree <2> The L2 module square of the score W of leaf nodes in the tree is L2 regularization of score W, which is equivalent to increasing L2 smoothing for score W of each leaf node, in order to avoid over-fitting.Copy the code

Let's take a look at XGBoost's objective function (loss function reveals training error + regularization definition complexity), as shown below:Copy the code

The regularization formula is also the latter half of the objective function. For the above formula, 𝑦 '𝑖 is the output of the whole accumulation model, and the regularization term ∑ K ω (ft) is the function representing the complexity of the tree. The smaller the value, the lower the complexity and the stronger the generalization ability.Copy the code

<3> How should a tree grow

We learned from beginning to end how XGBoost was optimized and calculated, but we never saw what the tree actually looked like. Obviously, a tree is created by dividing a node into two parts, then splitting into the whole tree. So how the tree splits is going to be the key to what we're going to talk about. For a node to split, XGBoost authors in their original paper gives a method of splitting nodes: enumerated all different tree structure of greedy Constantly enumeration of different tree structure, and then using the scoring function to find out the optimal structure of a tree, then added to the model, repeat this operation. This search process uses a greedy algorithm. Select a feature to split and calculate the minimum value of Loss Function. Then select another feature to split and obtain the minimum value of Loss Function. After enumerating, find the one with the best effect and split the tree to obtain the sapling. In summary, XGBoost uses the same idea as the CART regression tree, using greedy algorithms to traverse all feature partition points for all features, but using different target functions. The specific approach is that the objective function value after splitting is better than the gain of the objective function of the single leaf node. At the same time, in order to limit the tree growing too deep, a threshold value is added. Only when the gain is greater than this threshold value can splitting be carried out. It continues to split to form a tree, and then another tree, each time taking the best prediction to further split/build the tree.Copy the code

<4> How do I stop the loop generation of trees

There must be a stop condition for this kind of loop iteration. When should it stop? In short, set the maximum depth of the tree and stop growing when the sample weight sum is less than the set threshold to prevent overfitting. To be specific, it is: <1> When the gain brought by the introduced splitting is less than the set threshold, we can ignore the splitting. Therefore, the loss function will not increase as a whole with each splitting, which is a bit of pre-pruning. The threshold parameter is (the coefficient of the number of leaf nodes T in the regular term). <2> When the tree reaches the maximum depth, the establishment of the decision tree is stopped, and a hyperparameter max_depth is set to avoid the over-fitting caused by learning local samples because the tree is too deep. <3> When the sum of sample weight is less than the set threshold, tree construction will be stopped. The minimum sample weight and min_child_weight are similar but not identical to GBM's min_child_leaf parameters. The general idea is that a leaf node sample is too small, also terminated to prevent overfitting;Copy the code

How is 3/XGBoost different from GBDT

In addition to some algorithmic differences from traditional GBDT, XGBoost also has a lot of engineering optimizations. In general, the differences and connections between the two can be summarized in the following aspects. GBDT is a machine learning algorithm, and XGBoost is an engineering implementation of the algorithm. <2> When the CART regression decision tree is used as the base classifier, XGBoost explicitly adds regular terms to control the complexity of the model, which is beneficial to prevent overfitting and improve the generalization ability of the model. In other words, GBDT is easy to overfit, because there is no regular term to control the complexity of the model. <3>GBDT only uses the first-order derivative information of the loss function in model training, XGBoost performs second-order Taylor expansion of the loss function, and can use the first-order and second-order derivatives at the same time. <4> Traditional GBDT uses CART regression decision tree as base classifier, XGBoost supports many types of base classifier, such as linear classifier. <5> Traditional GBDT uses all data in each iteration, while XGBoost adopts a strategy similar to random forest to support data sampling. <6> Traditional GBDT is not designed to process the missing value, XGBoost can automatically learn the missing value processing strategy.Copy the code