Author: Byte Mobile Technology — Xu Yaokun

First of all, XGBoost is a learning method to train the decision Tree Model, specifically, the Tree Ensemble Model.

For example

So here’s a simple data set, where the horizontal axis is the dose of the drug, the green is the effective drug, the red is the ineffective drug.

Firstly, the decision tree model obtained after training is given.

Here, each non-leaf node is a judgment condition. Because the data set is relatively simple, there is only one judgment feature, and there will be multiple features in the actual use process. Each leaf node should have a weight, and notice that the initial prediction probability is set at 0.5, which is the probability that a dose of the drug will be effective when we don’t know anything, and then the tree gives an offset from the initial guess, such as when Dosage is less than 5, The given leaf node weight is -0.5, then the predicted probability is 0.5+(-0.5)=0, that is, the drug is useless at this dose output by the model.

Multiple trees is when you have multiple of these decision trees, and then the predicted result is equal to the sum of the output of all the decision trees.

So with that simple example in mind, let’s do some simple math.

Theoretical part

Tree integration model

In simple terms, the tree integration model is composed of multiple decision trees. For a given input, the forecast result of the tree integration model is equal to the sum of the forecast results of each decision tree.

Here is a formal definition of the original paper.

Given a data set D\mathcal{D}D, there are NNN training data in the data set, each data has MMM features,


D = { ( x i . y i ) } ( D = n . x i R m . y i R ) \mathcal{D} = \{(x_{i}, y_{i})\}(|\mathcal{D}|=n, x_{i}\in\mathbb{R}^{m}, y_{i}\in\mathbb{R})

A tree integration model integrating KKK decision trees can be expressed as,


y i ^ = ϕ ( x i ) = k = 1 K f k ( x i ) . f k F \hat{y_{i}} = \phi(x_{i})=\sum_{k=1}^{K}f_{k}(x_{i}), f_{k}\in\mathcal{F}

Fk (_)f_k(\cdot)fk(_) represents a decision tree for an input prediction result, and F\mathcal{F}F represents all possible decision trees. It’s important to note here that a decision tree consists of two aspects. One is the shape of the tree, which is what the tree looks like, how many branches it has, and how deep it is. The second is the weight of leaf nodes corresponding to the tree. During reasoning, the decision tree divides the test sample into the corresponding leaf node, and the weight of the leaf node is the score of the test sample under the decision tree, namely fK (xi) F_ {k}(x_{I})fk(xi).

How to train the tree integration model

The training model is to determine the parameters of the model, which in the context of XGBoost is to determine the shape of the decision tree and the weight of the leaves of the decision tree.

The loss function represents the difference between the predicted value and the true value of the model. If the loss function reaches a relatively small position, the model is considered to have been trained. The training model is the process of adjusting the parameters of the model to make the loss function smaller and smaller.

In general, the loss function of tree integration model is defined as follows:


L ( ϕ ) = i l ( y i ^ . y i ) + k Ω ( f k ) \mathcal{L}(\phi)=\sum_{i}l(\hat{y_{i}},y_{i})+\sum_{k}\Omega(f_{k})

Ω ( f ) = gamma T + 1 2 Lambda. Omega. 2 \Omega(f)=\gamma T+\frac{1}{2}\lambda||\omega||^{2}

The L(~)\mathcal{L}(\cdot)L(…) is called the loss function, as you can see, the loss function is a function of ϕ\phiϕ, notice here that ϕ\ Phi ϕ is a tree integration model. The first ∑ il (yi ^, yi) \ sum_ {I} l (\ hat {y_ {I}}, y_ {I}) ∑ il (yi ^, yi) represents the model and the real and estimated values of the gap, The second term ∑k ω (fk)\sum_{k}\Omega(f_{k})∑k ω (fk) represents the sum of the complexity of all trees. In the expansion of ω (f) \Omega(f) ω (f), TTT represents the number of leaves in a tree. Omega omega omega represents the sum of the weights of all the leaf nodes in a tree.

Simply put, for a given input xix_{I}xi, the model will have a corresponding output ϕ(xi)\phi(x_{I})ϕ(xi). Given a loss function, L(ϕ(xi))\mathcal{L}(\phi(x_{I}))L(ϕ(xi)) is obtained, and the parameters of ϕ\phiϕ can be obtained using the method of minimizing the value in mathematics.

As you can imagine, the above method tries to solve all the decision tree shapes and weight parameters at the same time, which is difficult to do.

Gradient Tree Boosting

The Gradient Tree Boosting method is an algorithm for training decision trees. The idea of the algorithm is very brief. Since it is difficult to solve all the parameters at once, only one tree at a time should be solved. Further, if each time the parameters of a tree are solved, the following tree is used to correct the errors of the preceding tree.

The loss function of the gradient lifting tree algorithm is given.


L ( t ) = i = 1 n l ( y i . y i ^ ( t 1 ) + f t ( x i ) ) + Ω ( f t ) \mathcal{L}^{(t)} = \sum_{i=1}^{n}l(y_{i}, \hat{y_{i}}^{(t-1)}+f_{t}(x_{i}))+\Omega(f_{t})

The loss function here is a recursive formula. The difference from the loss function described in the last section is that the predicted value is yi^\hat{y_{I}}yi^, and the parameters of objective optimization are the parameters of all decision trees. And the predicted value here is yi ^ (t – 1) + ft (xi) \ hat {y_ {I}} ^ {} (t – 1) + f_ {t} (x_ {I}) yi ^ (t – 1) + ft (xi), namely t – 1 before the predictions of a tree and prediction of the first t tree, the target optimization of parameters is the first t tree.

The general training process of the gradient lifting tree method can be summarized as: to solve the parameters of each tree successively (mainly to determine the structure of the tree and the weight of the leaves of the tree). During the training, a threshold value will be set in advance, such as the maximum depth of the tree, or if the loss obtained after the tree is split further down is reduced to less than a certain threshold, the training of this tree will be stopped and the training of the next tree will be carried out.

XGBoost

After identifying the normal gradient lifting tree, XGBoost uses Taylor expansion to approximate the recursive loss function, and then solves the parameters of the tree integration model.

L (t) \ mathcal {L} ^ {} (t) (t) L is relative to the ft f_ (xi) {t} (x_ {I}) ft (xi) of a function, here the L (t) \ mathcal {L} ^ {(t)} Taylor expansion get L (t),


L ( t ) i = 1 n [ l ( y i . y ^ ( t 1 ) ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) \mathcal{L}^{(t)} \simeq{\sum_{i=1}^{n}[l(y_{i},\hat{y}^{(t-1)})+g_{i}f_{t}(x_{i})+\frac{1}{2}h_{i}f_{t}^{2}(x_{i})]+\Omega(f_{t})}

Gi = partial l (yi, y ^ (t) – 1) partial y ^ (t – 1) g_ {I} = \ frac {\ partial l (y_ {I}, \ hat {} y ^ {(t – 1)})} {\ partial \ hat {} y ^ {(t – 1)}} gi = partial y ^ (t – 1) partial l (yi, y ^ (t – 1)), Hi = partial 2 l (yi, ^ y (t) – 1) partial y ^ 2 h_ (t – 1) = {I} \ frac {\ partial ^ {2} l (y_ {I}, \ hat {} y ^ {(t – 1)})} {\ partial \ hat {} y ^ {(t – 1) ^ {2}}} hi = partial y ^ (t – 1) partial 2 l (yi, y ^ (t – 1)), L (yi, y ^ (t – 1)) l (y_ {I}, \ hat {} y ^ {(t – 1)}) l (yi, y ^ (t – 1)) on y ^ (t – 1) \ hat {} y ^ {} (t – 1) y ^ (t – 1) the first derivative and second derivative. Using Taylor expansion, the loss function is transformed into a quadratic function, and the coefficient of the quadratic term here is positive, so the minimum value of the function can be conveniently obtained.

Ft (xi) F_ {t}(x_{I})ft(xi) means the output of input III on the TTT tree. The upper limit of the summation symbol NNN here represents the total number of samples in the training data set.

To this place, our goal is to solve ft(ester)f_t(\cdot)ft(…), in the form of a variable in the above equation, in the quadratic function, the extreme value of the axis of symmetry. For simplicity, delete the constant term that has nothing to do with ft(_)f_t(\cdot)ft(_).


L ( t ) i = 1 n [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) \mathcal{L}^{(t)} \simeq{\sum_{i=1}^{n}[g_{i}f_{t}(x_{i})+\frac{1}{2}h_{i}f_{t}^{2}(x_{i})]+\Omega(f_{t})}

Ft (f_t) (\cdot)ft(…) is determined by two main latitudes, one is the shape of the tree, the other is the weight of the tree’s leaves. And the mathematical expression here is,


i = 1 n f t ( x i ) = j = 1 T i I j Omega. j \sum_{i=1}^{n}f_t(x_i) = \sum_{j=1}^{T}\sum_{i\in{I_j}}\omega_{j}

The left hand side of the equation is clearly the sum of all the samples on the TTT tree, and the right hand side of the equation expresses this value in a different way, the first summation symbol is all the leaves, the second summation symbol is the set of samples that are divided into each leaf node, {I ∣ I ∈ Ij} \ {I | I \ {I_j} \} in {I ∣ I ∈ Ij} said was assigned to the first sample collection JJJ a leaf node. ωj\omega_{j}ωj represents the weight of the JJJ leaf node. The way to think about it is, which leaf node the sample is assigned to represents the structure of the tree.

If I plug this refinement into the loss function of the Taylor expansion approximation,


L t j = 1 T [ ( i I j g i ) w j + 1 2 ( i I j h i + Lambda. ) w j 2 ] + gamma T \mathcal{L}^{t} \simeq{\sum_{j=1}^{T}[(\sum_{i\in{I_{j}}}g_i)w_j+\frac{1}{2}(\sum_{i\in{I_{j}}}h_i+\lambda)w_j^{2}]}+\gamma T

Here given tree structure, we can infer that the weight of each leaf node in the approximate optimal solution under the condition of given tree structure implied in here I ∈ {I ∣ Ij} \ {I I \ | in \ {I_j}} {I ∣ I ∈ Ij} in the collection, because the type calculation need to be fixed on the collection to the next step calculation.

The optimal solution is denoted by,


w j = i I j g i i I j h i + Lambda. w^{\ast}_{j}=-\frac{\sum_{i\in{I_{j}}}g_{i}}{\sum_{i\in{I_{j}}}h_{i}+\lambda}

So far, the solution of the tree parameter weights has been derived. And with that solution, you can go back to the loss function and get the minimum loss.

So how do you determine the structure of the tree?

One of the key problems in tree learning is to find the best split. In order to do so, a split finding algorithm enumerates over all the possible splits on all the features.

The answer is enumeration. In a node, if you want to do the operation of splitting nodes, which samples should be divided into the left and which nodes should be divided into the right, XGBoost will sort all the samples according to a certain feature, and then split them, to determine the structure of the tree, enumeration down to calculate the minimum loss value, as the optimal structure.

Examples of training

Put the picture at the beginning of the article in a table,

Drug Dosage Effectiveness
g i g_{i}

h i h_{i}
4 0 0.5 0.75
9 1 0.5 0.25
11 1 0.5 0.25
17 0 0.5 0.75

Firstly, it is clear that the initial predicted value for all samples is 0.5, corresponding to y^(t−1)\hat{y}^{(t-1)}y^(t−1) in the above formula. The label for negative samples is 0, and the label for positive samples is 1, corresponding to Yiy_ {I}yi in the above formula. At the beginning of training, all samples are on the leaf node.

The figure shows the state of the leaf node at the beginning of training. All the training samples are at the leaf node, and the leaf node is represented by its characteristic (Drug Dosage) :

The data is substituted into the formula for wj∗w^{\ast}_{j}wj∗ to calculate the weight of leaf nodes, and then substituted into the corresponding Lt\mathcal{L}^{t}Lt formula to get the loss at this time. In order to make the calculation convenient, the regular direction is removed, that is, γ=0\gamma =0γ=0. Get Lt = 0 \ mathcal {L} ^ {t} = 0 Lt = 0.

We first try to segment the feature from a dose less than 15, so we get this tree,

In this case, the loss of this tree is equal to the loss of the left subtree plus the loss of the right subtree. By substituting the data, we can get Lt=−1.33\mathcal{L}^{t}=-1.33Lt=−1.33. The loss is reduced.

Then enumerate in turn, calculate the loss reduction amplitude under the circumstances of Dosage < 10 and Dosage < 5, and take the case of maximum loss reduction as the mode of node splitting. Then each subtree is divided recursively until there is only one sample for each leaf node, or the depth of the tree reaches a preset limit. Then the training stops and the decision tree is trained.

The rest of the tree is trained the same way, but instead of the initial guess of 0.5, it is the sum of 0.5 plus all the previous tree predictions.

Note

  1. Most of the formulas in this paper are from the original paper, but the following formulas are added by the author, hoping to be helpful to the derivation and understanding of the formulas

i = 1 n f t ( x i ) = j = 1 T i I j Omega. j \sum_{i=1}^{n}f_t(x_i) = \sum_{j=1}^{T}\sum_{i\in{I_j}}\omega_{j}
  1. The solution of the first derivative and the second derivative of the loss function of the classification problem is omitted in this paper.
  2. In addition to using Taylor’s second-order expansion to find extreme values, XGBoost has a number of algorithmic and hardware level optimizations in implementation, which are listed below:
  • A parallel strategy is used for the segmentation decision of each feature: first, each feature is ordered, and because the segmentation of the features at different locations is independent (such as Dosage<15 and Dosage<10 in the example above), the computation can be performed using parallel threads to speed up the training.

  • Gradient data caching strategy: In the original data, samples are not stored in order of eigenvalues, or for different eigenvalues, samples are not stored consecutively. So when you compute the loss function, when you take the corresponding first and second derivatives, the access to memory is not continuous. XGBoost is implemented by putting gradient data into an additional memory, using prefetching and caching methods to improve cache hit ratio, thus improving data IO speed.

  • Decentralized memory strategy: In order to achieve decentralized computing, such as when the amount of data is too large to run on a single machine, the data is split into different blocks and all the blocks are stored on disk. During the calculation process, a separate thread is used to prefetch data from the disk, ensuring that the calculation and data fetching can occur simultaneously.

Reference

  1. The original paper
  2. Example source

About the Byte Mobile Platform team

Bytedance Client Infrastructure is the industry leader in big front-end Infrastructure, responsible for the construction of big front-end Infrastructure in The entire China region of Bytedance, improving the performance, stability and engineering efficiency of the company’s entire product line. The supported products include but are not limited to Douyin, Toutiao, Watermelon video, Huoshan small video, etc., which have been researched in depth on mobile terminal, Web, Desktop and other terminals.

Now is the time! Client/front-end/server/intelligent algorithm/test development for global recruitment! Let’s use technology to change the world. If you are interested, please contact [email protected]. Email subject: Resume – Name – Job Objective – Expected city – Phone number.