- By Han Xinzi @Showmeai
- Tutorial address: www.showmeai.tech/tutorials/3…
- This paper addresses: www.showmeai.tech/article-det…
- Statement: All rights reserved, please contact the platform and the author and indicate the source
The introduction
XGBoost, which stands for eXtreme Gradient Boosting, is a very powerful Boosting algorithm kit whose excellent performance (efficiency and speed) has made it the top solution in the screen data science competition for a long time. This model is still the first choice for many big factory machine learning solutions.
XGBoost is very excellent in parallel computing efficiency, missing value processing, control overfitting and prediction generalization ability. In this article, we give you a detailed introduction to XGBoost, including “algorithm principle” and “engineering implementation” two aspects.
XGBoost: XGBoost: XGBoost: XGBoost: XGBoost: XGBoost: XGBoost: XGBoost
(Part of XGBoost covers machine learning basics, decision tree/regression tree /GBDT algorithms, No sequence knowledge reserve first baby articles can view ShowMeAI graphic machine learning | basic knowledge of machine learning, decision tree model, rounding, rounding, regression tree model) and the illustration of machine learning | GBDT, rounding) model.
1. Visual interpretation of algorithm principle
Chen Tianqi, the author of XGBoost, made a systematic introduction in detail about the principle of XGBoost. We used this information to explain it to everyone here.
1) Some important concepts in supervised learning
Before we begin with the VTREE, let’s review some important concepts in machine learning.
(1) Supervise the core elements of learning
Notations: xi∈Rdx_i \in R^ dXI ∈Rd represents the third sample in the training set.
Model: How do I predict y^ I \hat{y}_iy^ I for a given xix_ixi?
σ jwjxij\hat{y}_{I}=\Sigma_{j} w_{j} x_{ij}y^ I = σ Jwjxij \hat{y}_{I}=\Sigma_{j} w_{j} x_{ij}y^ I = σ Jwjxij \hat{y}_{I}=\Sigma_{j} w_{j} x_{ij} The predicted value y^ I \hat{y}_iy^ I can be interpreted differently depending on the task:
-
Linear Regression: y^ I \hat{y}_iy^ I represents predicted scores.
-
Logistic Regression (Logistic Regression) : 1 / (1 + e – y ^ I) 1 / (1 + e ^ {- \ hat _i} {y}) 1 / (1 + e – y ^ I) predicting the right in the instance of probability.
-
Other: for example y^ I \hat{y}_iy^ I can be a ranking score in a ranking task.
Parameters: Something to learn from data.
- θ ={wj∣j=1,2… , d} \ Theta = \ left \ {w_j | j = 1, 2, \ dots, d \} \ right Θ = {wj ∣ j = 1, 2,… , d}.
The Objective function (Objective function) Obj (Θ) = L (Θ) + Ω (Θ) Obj (\ Theta) = L (\ Theta) + \ Omega (\ Theta) Obj (Θ) = L (Θ) + Ω (Θ)
-
L(θ)L(\Theta)L(θ) represents Training Loss function, indicating how well the model fits the Training data.
-
Omega (θ)\Omega (Theta) ω (θ) measures the complexity of the model for Regularization term (Regularization).
Training data Loss function (Training Loss) L = Σ I = 1 nl (yi, y ^ I) L = \ Sigma_ {I = 1} ^ {n} L (y_i, \ hat {} y _i) L = Σ I = 1 nl (yi, y ^ I)
-
Square Loss (Square Loss) : l (yi, y ^ I) = (yi – y ^ I) 2 l (y_i, \ hat {} y _i) = (y_i – \ hat {} y _i) l ^ 2 (yi, y ^ I) = (yi – y ^ I) 2
-
Logistic Loss: L (yi, y ^ I) = yiln (1 + e – y ^ I) + (1 – yi) ln (1 + ey ^ I) l (y_i, \ hat {} y _i) = y_iln (1 + e ^ {- \ hat _i} {y}) + (1 – y_i) ln (1 + e ^ {\ hat _i} {y}) l = yi (yi, y ^ I) Ln (1 + e – y ^ I) + (1 – yi) ln (1 + ey ^ I)
Regularization term (Regularization) : describes the complexity of model.
-
L1 Norm (lasso) : Ω (w) = lambda ∣ ∣ w ∣ ∣ 1 \ Omega (w) = \ lambda | | w | | _1 Ω (w) = lambda ∣ ∣ w ∣ ∣ 1
-
L2 Norm: Ω (w) = lambda ∣ ∣ w ∣ ∣ 2 \ Omega (w) = \ lambda | | w | | ^ 2 Ω (w) = lambda ∣ ∣ w ∣ ∣ 2
(2) Supervised learning of advanced knowledge
Ridge regression: Σ I = 1 n (yi – wTxi) 2 + lambda ∣ ∣ w ∣ ∣ \ Sigma_ 2 ^ {I = 1} {n} (y_i – w ^ Tx_i) ^ 2 + \ lambda | | w | | ^ 2 Σ I = 1 n (yi – wTxi) 2 + lambda ∣ ∣ w ∣ ∣ 2
- The Ridge is a Linear Model using Square Loss and the regularization term is L2 Norm.
Lasso: Σ I = 1 n (yi – wTxi) 2 + lambda ∣ ∣ w ∣ ∣ 1 \ Sigma_ {I = 1} ^ {n} (y_i – w ^ Tx_i) ^ 2 + \ lambda | | w | | _1 Σ I = 1 n (yi – wTxi) 2 + lambda ∣ ∣ w ∣ ∣ 1
- Lasso is a Linear Model that uses Square Loss and the regularization term is L1 Norm.
Logistic Regression: Σ I = 1 n (yiln (1 + e – wTxi) + 1 – (yi) ln (1 + ewTxi)) + lambda ∣ ∣ w ∣ ∣ \ Sigma_ 2 ^ {I = 1} {n} (y_iln (1 + e ^ ^ {- w Tx_i}) + (1 – y_i) ln (1 + e ^ ^ {w Tx_i})) + \ lambda | | W | | ^ 2 Σ I = 1 n (yiln (1 + e – wTxi) + 1 – (yi) ln (1 + ewTxi)) + lambda ∣ ∣ w ∣ ∣ 2
- Logistic regression is a Linear Model using Logistic Loss, and the regularization term is L2 Norm.
(3) Objective function and bias variance tradeoff
Obj review the objective function (Θ) = L (Θ) + Ω (Θ) Obj (\ Theta) = L (\ Theta) + \ Omega (\ Theta) Obj (Θ) = L (Θ) + Ω (Θ), why is the objective function to two parts to you?
-
The optimization of Training Loss function helps to establish prediction model, and a good fitting of Training data can at least make you closer to the distribution of potential Training data.
-
Regularization optimization (Regularization) is helpful to build simple models: the simpler the model, the smaller the variance of future prediction, and the more stable the prediction.
2) Regression Tree and Ensemble
(1) Regression Tree
Regression Tree, also known as Classification and Regression Tree (Regression Tree model)
- Decision rules are the same as decision trees
- Each leaf node has a value
(2) Regression Tree Ensemble
As you can see from the left picture, there are five training samples.
As can be seen from the middle figure above, each leaf node has a predicted value: 2 for the first leaf node, 0.1 for the second leaf node and -1 for the third leaf node.
- The boy was assigned to the first leaf node, so the boy’s predicted value was 2;
- The youngest daughter was assigned to the second leaf node, and her predicted value was 0.1;
- The remaining three people are assigned to the third leaf node, and they all have a value of -1.
The final predicted value is the sum of the predicted values of the leaf nodes in each tree.
(3) Tree integration method
Tree integration method is very widely used, such as GBM, random forests (see graphic machine learning | ShowMeAI article GBDT model explanation and illustration of machine learning | random model of forest classification explanation). As many as half of data mining competitions are won by using a variety of tree integration approaches.
- It is not affected by the dimensions of the input, and therefore does not require careful standardization of the characteristics.
- Higher-order interactions between learning features (higher-order cross features).
- It can be extended for different industries.
(4) The model and parameters of VANDA Vanda Tree
Model: suppose we have K tree: y ^ I = Σ K = 1 KFK (xi), fk1 – fk ∈ \ hat F {} y _i = \ Sigma_ {K = 1} ^ Kf_k (x_i), f_k \ in Fy ^ I = Σ K = 1 KFK (xi), fk1 – fk ∈ F. Where, F is the function space containing all regression trees.
- A regression tree is a function that maps attributes to scores.
Parameters: include the structure of each tree and the fraction in the leaf nodes. Or use functions as arguments: θ ={f1,f2… Fk1 – fK,} \ Theta = \ left \ {f_1, f_2,… , f_K \ right \} Θ = {f1, f2,… Fk1 – fK}.
- Here we do not learn the weights of RdR^dRd, we learn the function (Tree) in the vanda Tree.
(5) Study the Vwintree on a single variable
A single variable is a single feature, and we can get a simple idea of the learning mode of the vWINTree by learning how to vwinTree vwintree on a single variable.
For example: Assume that the regression tree has only one input variable t (time), hoping to predict how much brother likes romantic music at time T.
As you can see from the top left, when he was single, he was not interested in romantic music. But when he met a girlfriend, his liking for romantic music increased as the distribution of hormones increased; But one day after a breakup, his love for romantic music began to decline again. Of course, we can also see that the regression tree of the upper right graph can easily express the upper left graph.
To build the tree on the top right, we need to learn two things:
- 1. The point of splitting;
- 2. The height on each section.
Objective of univariate regression tree (Step function)
- Training Losses: How does a function fit points?
- Regularization: How to define the complexity of a function? (Number of breakpoints and L2 Norm per height?)
- Figure (1) is the scatter chart of brother’s liking for romantic music at each time;
- Figure (2) shows that there are too many partitions and the complexity of the model is very high, so the model’s ω (f){color{DarkGreen} \Omega (f)} ω (f) is very high;
- Figure (3) shows that the fitting degree of the model is poor, so L(f){\color{DarkGreen} L(f)}L(f) is very high;
- Figure (4) is the best, both fitting degree and complexity degree are very appropriate;
(6) The general condition of the VFW
First, let’s review our definition of vtree above:
Model: suppose we have K tree: y ^ I = Σ K = 1 KFK (xi), fk1 – fk ∈ \ hat F {} y _i = \ Sigma_ {K = 1} ^ {K} f_k (x_i), f_k \ in Fy ^ I = Σ K = 1 KFK (xi), fk1 – fk ∈ F
Objective function: Obj = Σ I = 1 nl (yi, y ^ I) + Σ k = 1 k Ω (fk1 – fk) Obj = \ Sigma_ {I = 1} ^ nl (y_i, \ hat {} y _i) + \ Sigma_ {k = 1} ^ k \ Omega (f_k) Obj = Σ I = 1 nl (yi, y ^ I) + Σ k = 1 k Ω (fk1 – fk)
-
Σ I = 1 nl (yi, y ^ I) \ Sigma_ {I = 1} ^ nl (y_i, \ hat {} y _i) Σ I = 1 nl (yi, y ^ I) is a cost function
-
Σ k = 1 k Ω (fk1 – fk) \ Sigma_ {k = 1} ^ k \ Omega (f_k) Σ k = 1 k Ω fk1 – fk is regularization item, on behalf of the complexity of the tree, the tree the more complex the higher the value of the regularization item (regularization item how to define the detailed said we will be in the back).
When we talk about trees, it’s usually heuristic:
- Splitting is done by information gain
- Prune trees
- Maximum depth
- Smooth leaves value
Most heuristic algorithms map well to target functions, using the formal (target) view to let us know what we are learning:
- Information gain → training loss
- Pruning → regularization of nodes
- Maximum depth – constraints on function space
- Smoothed blade value – L2 regularizes the weight of the blade
Regression tree integration defines how to get the predicted value. It can not only do regression, but also do classification and ranking. The specific tasks to be done depend on the definition of the “objective function” :
- Using the square error: You can get the Gradient Vauxhall machine for the regression problem.
- Using logarithmic loss: LogitBoost can be obtained for classification.
3) Gradient Boosting
In this section, we will formally learn Gradient Boosting. Here, xgboost processing you can contrast GBDT model (refer to ShowMeAI article graphic machine learning | GBDT, rounding) model to understand the core differences.
(1) Solutions
Objective function: Obj = Σ I = 1 nl (yi, y ^ I) + Σ k = 1 k Ω (fk1 – fk) Obj = \ Sigma_ {I = 1} ^ nl (y_i, \ hat {} y _i) + \ Sigma_ {k = 1} ^ k \ Omega (f_k) Obj = Σ I = 1 nl (yi, y ^ I) + Σ k = 1 k Ω (fk1 – fk)
When we did GBDT, we couldn’t use SGDS because they were trees, not vector numbers — that is, from the familiar parameter space to the function space.
- Parameter space: Weights in learning models.
- Function space: learn the function FFF, including the structure of the function and its weight.
Solution: Initialize a predicted value and add a new function (FFF) each iteration:
-
Y ^ I (t)\hat{y}_i^{(t)}y^ I (t) is the predicted value of the TTT iteration.
-
Y ^ I (t – 1) \ hat {} y _i ^ {} (t – 1) y ^ I (t – 1) is t t t – 1-1-1 iteration forecasts.
-
Ft (xi) f_T (x_i)ft(xi) is the TTT tree, which is what we need to get for the TTT iteration.
(2) Objective function transformation
Step 1: According to the above formula, the objective function can be transformed as follows
Here we consider the squared loss, at which point the objective function can be transformed into:
Step 2: So our goal is to find ftF_TFT that minimizes the objective function. However, the objective function after the above initial deformation is still very complex, and the objective function produces quadratic terms.
Here we introduce the Taylor expansion formula:
-
That f (x) = Σ I = 1 nl (yi, y ^ I (t)) f (x) = \ Sigma_ {I = 1} ^ nl (y_i, \ hat {} y _i ^ {(t)}) f (x) = Σ I = 1 nl (yi, y ^ I (t))
-
Make Δ x = ft \ Delta x = f_t Δ x = ft
Using Taylor’s expansion, the objective function can be written as:
The third part: by taking out the constant term, the objective function can be simplified as
Consider: Why make so many changes instead of just spanning trees?
-
Theoretical benefit: Know what we’re learning, convergence.
-
Engineering benefits: Review the elements of supervised learning.
- Gi,hig_i, h_IGi,hi all come from the definition of the loss function.
- Functions are learned only by gi,hig_i, h_igi,hi depending on the target function.
- Consider how to separate code modules when asked to implement Bootsted trees for square and logical losses.
(3) Redefine the tree
Previously, we used ft(x)f_t(x)ft(x) to represent a tree. In this section, we redefine the tree: we define the tree by a fraction vector in leaf nodes and an index mapping function that maps instances to leaf nodes (a bit abstract, see below).
- WWW represents the weight of the leaves in the tree
- QQQ stands for tree structure
As can be seen from the figure above, there are three leaf nodes. The weight of the first leaf node is +1, the weight of the second leaf node is 0.1, and the weight of the third leaf node is -1. The little boy belongs to the first leaf, and the grandmother belongs to the third leaf.
(4) Define the complexity of the tree
The complexity of the tree is defined by the following formula (the definition is not unique)
-
Gamma T\gamma T gamma T represents a tree of leaves
-
12λ σ j=1Twj2\frac{1}{2}\lambda\Sigma_{j=1}^Tw_j^221 The L2 Norm of leaves node fraction λ σ j=1Twj2
(5) Re-examine the objective function
The set of instances defined in leaf node JJJ is:
Redefine the objective function for each leaf:
- This is the sum of T independent quadratic functions
(6) Calculate the value of leaf node
Some knowledge needs to be understood first. For the unary quadratic function: Gx+12Hx2Gx+\frac{1}{2}Hx^2Gx+21Hx2, we can easily obtain the minimum of this function and the corresponding minimum XXX.
-
The minimum value corresponding to XXX is equivalent to taking the derivative of Gx+12Hx2Gx+\frac{1}{2}Hx^2Gx+21Hx2 and making the derivative equal to 0, i.e. Gx+Hx=0Gx+Hx=0Gx+Hx=0, so x=−G/Hx= -g /Hx=−G/H.
-
When x = / Hx – G – G/Hx = – = G/H, the corresponding Gx + 12 hx2gx + \ frac {1} {2} Hx Gx ^ 2 + 21 hx2 value is: – 12 ∗ G2 / H – \ frac {1} {2} * G ^ 2 / H – 21 ∗ G2 / H
That is:
How to find the optimal value of the leaf node? Then continue to vary the objective function:
-
Define Gj= σ I ϵIjgiG_j= \Sigma_{I \epsilon I_j}g_iGj= σ I ϵIjgi
-
Define Hj= σ I ϵIjhiH_j = \Sigma_{I \epsilon I_j}h_iHj= σ I ϵIjhi
Assuming that the structure of the tree q(x)q(x)q(x) is fixed, the optimal weight of each leaf node is
The optimal value of the objective function is
Below is a practical example of the formula explained above.
To sum up again, we have changed the objective function to be related only to the five known parameters G, H, γ, λ, TG, H, \gamma, \lambda, TG, H, γ, λ and T, eliminating the previous variable FTF_ {T}ft and eliminating the need to score each leaf! So now the question is, as we mentioned, this is what we get if we assume the tree structure is fixed. But there are so many different tree structures, how do we know for sure?
(7) Greedy algorithm spanning tree
In the last section we assumed that the structure of the tree was fixed. However, there are actually infinite kinds of possible tree structures. In this section, we use greedy algorithm to generate trees:
-
Start by generating a tree with zero depth (only one root, also known as a leaf)
-
For each leaf in each tree, try to split (create two new leaves, the original leaves are no longer leaves). (We hope that the objective function after the tree is added is smaller than the objective function before, so the objective function after the tree is subtracted from the objective function before) :
-
GL2HL+λ\frac{G_L^2}{H_L+\lambda}HL+λGL2 is the value of the left leaf node
-
GR2HR+λ\frac{G_R^2}{H_R+\lambda}HR+λGR2 is the value of the right leaf node
-
(GL + GR) 2 + HR + HL lambda \ frac {(G_L + G_R) ^ 2} {H_L + H_R + \ lambda} + HR + lambda HL (GL + GR) 2 is not split the value of the former
-
Gamma \gamma gamma is introduced with the added complexity of an extra leaf node
The next thing to consider is how to find the optimal split point.
For example, if xjx_jxj is age, what is the gain when the split point is AAA?
All we need to do is compute the GGG and HHH on each side and compute
A linear left-to-right scan of the sorted instances is sufficient to determine the optimal split point for the feature.
So, the way to split a node is:
- For each node, enumerate all properties
- For each feature, the instances are sorted by attribute value
- Linear scanning was used to determine the optimal split point for the feature
- Adopt the best splitting scheme of all features
Time complexity:
-
For a tree with D features and depth of K, the computation time complexity is O(dKnlog n). Each of these layers takes order nlog n time to sort.
-
Further optimizations can be made (such as the use of approximate or cache sort features).
-
Can scale to very large data sets.
(8) How to deal with classified variables
Some tree learning algorithms deal with classified variables and continuous variables respectively, and we can easily use the scoring formula we derive based on classified variables. But in fact, we don’t have to deal with categorical variables separately:
We can handle categorical variables in one-hot mode:
If there are too many categories, the matrix will be very sparse and the algorithm will preferentially process the sparse data.
(9) Pruning and regularization
Reviewing the previous gain, the gain may be negative when the training loss is reduced by a value less than the regularization complexity:
This is the tradeoff between simplicity and predictability of the model.
- Pre-stopping: stopping when the optimal split is negative. But a split may be good for a future one;
- Post-pruning: To grow a tree to its maximum depth and recursively prune all the leaves that divide into negative gain.
2.XGBoost core principle induction and analysis
1) Objective function and Taylor expansion
XGBoost is also a Boosting addition model, with each iteration only optimizing the submodels in the current step. In the first step, we have:
-
FM (xi) F_M (x_I) FM (xi) is the submodel of the current step.
-
Fm−1(XI)F_{M-1}(x_I) Fm−1(xi) refers to the first m− 1m-1m-1 submodels that have been trained and fixed.
The objective function is designed as “empirical risk” + “structural risk” (regular term) :
- The regular term Omega (f) Omega (f) Omega (f) represents the complexity of the submodel FFF and is used to control overfitting.
In mathematics, we can use Taylor’s formula to approximate f(x)f(x)f(x) as follows. XGBoost approximates the loss function using second-order expansion.
For more math, read ShowMeAI’s basic AI Math: From Beginner to Master series.
Fm−1(xi)F_{m-1}(x_i)Fm−1(xi) as x0x_0x0, Fm (xi)F_{m}(x_I)Fm (xi) as δ x\Delta x δ x, L (yi ^, yi) L (\ hat {y_i}, y_i) L (yi ^, yi) as about yi ^ \ hat {y_i} yi ^ function, are:
Since the former M − 1m-1m-1 submodel has been determined, all the parts in the above formula except FM (x)f_{m} (x) FM (x) are constant, which does not affect the optimal solution of FM (x)f_{m} (x) FM (x). The objective function can be transformed into:
-
-
-
The LLL here stands for the loss function, which measures how good or bad a forecast is
-
In the case of Fm−1(x)F_{m-1}(x)Fm−1(x), a GIG_IGi and hih_IHI can be easily calculated for each sample point III
2) Regularization of XGBoost
In fact, XGBoost’s base classifier supports both decision trees and linear models, and we’ll only discuss the more common “tree-based” case here. To prevent overfitting, XGBoost sets tree-based complexity as the regular term:
-
TTT is the number of leaf nodes of FFF tree
-
WWW is returning all leaf nodes output value vector that ∣ ∣ w ∣ ∣ 2 | | w | | ^ 2 ∣ ∣ w ∣ ∣ 2 L2 norm for the vector (length) of the square
-
γ\gammaγ and λ\lambdaλ are hyperparameters
As a regression tree, the more leaf nodes, the larger the output regression value, the higher the complexity of the tree.
The final objective function is as follows:
The following is a mathematical transformation process to merge the regular and empirical risk items together. We convert the empirical risk terms summed at the sample level to the sum at the leaf level.
JJJ define node on the sample sets (I) (j) = {xi (xi) ∣ q = j} (I) (j) = \ {x_i | q (x_i) = j \} (I) (j) = {xi (xi) ∣ q = j}, the q (xi) q (x_i) q (xi) as the sample is mapped to a leaf node index function, The return of the leaf nodes on JJJ value of wj = FM (xi), I ∈ (j) w_j = I f_m (x_i), I \ I (j) in wj = FM (xi), I ∈ (I) (j).
To further simplify the expression, make the ∑ I ∈ (I) (j) gi = Gj \ sum_ {I \ in the I (j)} g_i = G_j ∑ I ∈ (I) (j) gi = Gj, ∑ I ∈I(j)hi=Hj\sum_{I \in I(j)} h_i=H_j∑ I ∈I(j)hi=Hj note that G and H are functions of JJJ:
After transformation to this form, we can see that if the structure of a tree has been determined, the samples (Xi, Yi, GI, Hi)(X_i, Y_i,g_i, h_I)(Xi, Yi, GI,hi) in each node are also determined, that is, GjG_jGj, HjH_jHj, TTT are determined. The regression value output by each leaf node should minimize the above formula, and the extreme point of the quadratic function is:
After the regression value is output according to this rule, the objective function value, namely the score of the tree, is as follows: the smaller the value, the better the structure of the tree. Observing the following equation, the score of the tree can also be understood as the sum of the scores of all leaf nodes:
3) Node splitting criterion
In our previous article [Detailed Explanation of decision Tree Model], we told us that the decision tree model is formed by recursive growth, and XGBoost’s sub-model tree is the same, which needs to rely on the greedy criterion of node recursive splitting to achieve the generation of the tree.
In our previous article, we explained in detail the decision tree model that the decision tree model is formed by recursive growth, and the sub-model tree of XGBoost is the same, which needs to rely on the greedy criterion of node recursive splitting to achieve the generation of the tree.
(1) Criterion of greed
The basic processing idea of XGBoost subtree is the same as that of CART. It will sort the feature values and traverse the partition points, and take the optimal splitting gain as the splitting gain of this feature. The feature with the optimal splitting gain is selected as the partition feature of the current node, and the left and right subtrees are obtained according to the optimal partition point.
The figure above shows A node splitting process. Naturally, the splitting benefit is the score of tree A minus the score of tree B. The scores of the leaf nodes outside the dotted box, that is, the non-split nodes, are cancelled, leaving only LR nodes after splitting and S nodes before splitting for comparison. Therefore, the splitting benefit can be expressed as follows:
(2) Approximation algorithm
XGBoost also makes an approximate version of the greedy criterion based on performance considerations, which, in a nutshell, “uses the feature quantile as a partition candidate.” In this way, the set of candidate points is reduced from traversal between the whole sample to traversal between several quantiles.
From the perspective of expansion, there are two alternative strategies for selecting feature quantiles: global and local:
- Global is selected from the eigenvalues of all samples, which can be conducted once before the root node splitting.
- Local is selected from the eigenvalues of the samples contained in the nodes to be split, which should be carried out before each node is split.
In both cases, since global can only be partitioned once, the granularity needs to be finer.
XGB’s original paper carried out experiments on Higgs Boson data set, compared AUC of test set configured with three types of exact greedy criterion, global approximation and Local approximation, and used EPS to represent the granularity of picking points. If EPS =0.25 EPS =0.25 EPS =0.25 means that the data set is divided into 1/0.25=4 buckets. It is found that both Global (EPS =0.05) and Local (EPS =0.3) can achieve almost the same performance as the accurate greedy criterion.
The XGBoost toolkit supports the three types of configuration mentioned above.
(3) Weighted quantile
View the objective function Obj = ∑ I = 1 N [gifm (xi) + 12 hifm2 (xi)] + Ω (FM) Obj = \ sum_ {I = 1} ^ {N} \ left [g_ {I} f_ {m} \ left (x_ {I} \ right) + \ frac {1} {2} h_ {I} F_ ^ {m} {2} \ left (x_ {I} \ right), right] +, Omega, left (f_ {m} \ right) Obj = ∑ I = 1 n [gifm (xi) + 21 hifm2 (xi)] + Ω (FM), The partial derivatives of 0 easy FM ∗ (xi) = – gihif_m ^ * (x_i) = – \ frac {g_i} {h_i} FM ∗ (xi) = – higi, this objective function can be understood as hih_ihi as weight, − Gihi-\ frac{g_i}{h_I}− Higi is the tag’s quadratic loss function:
When the approximation algorithm takes quantiles, XGBoost actually takes the Weighted Quantile with the second derivative HIh_IHI, as shown in the following figure.
4) Column sampling and learning rate
XGBoost has been further designed to further optimize its performance in two areas:
-
Column sampling
- Consistent with the practice of random forest, each node splitting does not use all features as a candidate set, but a subset.
- Doing so provides better control over fitting and reduces computational overhead.
-
vector
- They are also called step length or shrinkage. The specific operation is to multiply the coefficient before each sub-model (i.e. on the regression value of each leaf node), so that individual trees do not fit too radically, leaving some space for more stable iterations. XGBoost defaults to 0.3.
5) Feature loss and sparsity
XGBoost can also handle missing values, as well as feature sparseness issues (features with a large number of 0 or one-hot encoding results). XGBoost uses a “sparse-aware” strategy to deal with both problems simultaneously:
- To put it simply, it treats missing values and sparse 0 values as missing values, “binds” them together, and the traversal of split nodes skips the whole of missing values. This greatly improves the computational efficiency.
Whether the value of 0 is treated as a numerical value or NA in XGB needs to be combined with the Settings of specific platforms. Preprocessing distinguishes 0 as a numerical value (should not be treated as NA) from 0 as a sparse value (should be treated as NA).
The split node is still obtained through traversal. There are two situations in the direction of NA. On this basis, the non-missing values are segmented and traversed.
As shown in the figure above, if a certain eigenvalue is 1,2,5 and a large number of NA, XGBoost traverses the above six cases (the splicing points of the three non-missing values × the two directions of the missing values), and the maximum splitting gain is the splitting gain on this feature, meanwhile, NA will be assigned to the right node.
3. Optimization of XGBoost project
1) Parallel row block design
XGBoost sorts each column of features in advance, stores them in the cache as blocks, and indexes eigenvalues to gradient statistics. The ordered blocks are called repeatedly each time the node is split. Moreover, different features are distributed in separate blocks, so distributed or multithreaded calculations can be performed.
2) Cache access optimization
After sorting eigenvalues, the gradient GI, HIG_i, h_IGi and HI are calculated by index, which will lead to inconsistent memory space access, thus reducing cache hit ratio and affecting algorithm efficiency. To solve this problem, XGBoost allocates a separate contiguity cache for each thread to store gradient information.
3) Extranuclear block calculation
A large amount of data cannot be loaded into memory at the same time. XGBoost stores data in the hard disk blocks and uses a separate thread to read data from the disk into memory, allowing computing and reading to take place simultaneously. To further improve disk read performance, XGBoost also uses two methods:
- ① Compress the block, and exchange the cost of decompression for the cost of disk reading.
- ② Store blocks on multiple disks to improve disk throughput.
4.XGBoost vs GBDT
Here’s a summary of the comparison between GBDT and XGBoost:
-
GBDT is a machine learning algorithm, and XGBoost has some engineering implementation optimizations based on the algorithm.
-
GBDT uses the first derivative of loss function, which is equivalent to gradient descent in function space. XGBoost also uses the second derivative of the loss function, equivalent to Newton’s method in function Spaces.
-
Regularization: XGBoost explicitly adds regularization terms to control model complexity, effectively preventing overfitting.
-
Column sampling: XGBoost uses the practice of random forest, where rows and columns are randomly sampled each time a node splits.
-
Missing values: XGBoost uses sparse perception strategy to process missing values, while GBDT has no missing value processing strategy.
-
Parallel efficiency: XGBoost’s column block design can effectively support parallel computing, which is more efficient.
More supervised learning algorithm model summary articles can view ShowMeAI AI knowledge skill quick | machine learning – learning supervision.
ShowMeAI related articles recommended
- 1. Machine learning basics
- 2. Model evaluation methods and criteria
- 3.KNN algorithm and its application
- 4. Detailed explanation of logistic regression algorithm
- 5. Detailed explanation of naive Bayes algorithm
- 6. Detailed explanation of decision tree model
- 7. Detailed explanation of random forest classification model
- 8. Detailed analysis of regression tree model
- 9. Detailed explanation of GBDT model
- 10. The XGBoost model is fully resolved
- 11. Detailed explanation of LightGBM model
- 12. Detailed explanation of support vector machine model
- 13. Detailed explanation of clustering algorithm
- 14.PCA dimension reduction algorithm in detail
ShowMeAI series tutorials recommended
- Illustrated Python programming: From beginner to Master series of tutorials
- Illustrated Data Analysis: From beginner to master series of tutorials
- The mathematical Basics of AI: From beginner to Master series of tutorials
- Illustrated Big Data Technology: From beginner to master
- Illustrated Machine learning algorithms: Beginner to Master series of tutorials