In the field of machine Learning, Deep Learning (Deep Learning\text {Deep Learning}Deep Learning) is basically dominated by neural networks. Shallow Learning (Shallow Learning\text {Shallow Learning}Shallow Learning) still belongs to the domain of tree model. On the one hand, although deep learning is strong in large-scale learning, its performance in small-scale learning is not satisfactory. On the other hand, the integrated tree algorithm (RF\text {RF}RF, GBDT\text {GBDT}GBDT, XGBoost\text {XGBoost}XGBoost, etc.) has the advantages of high model interpretation, low parameter adjustment difficulty, fast running speed and almost no feature engineering. Deep learning is completely crushed on small – and medium-sized datasets, and for “heterogeneous data” (such as age, income, city in risk control scenarios), integrated tree models perform better than deep learning even on large datasets. In practical application, Facebook, Alibaba and other large enterprises are using GBDT\text {GBDT}GBDT combined with LR as the technical support for CTR estimation, advertising recommendation and other important businesses. XGBoost text {XGBoost}XGBoost is more in recent years in the Kaggle algorithm competition repeatedly bloom.

Boosting\text {Boosting}Boosting in ensemble learning and its family members AdaBoost\text {AdaBoost}AdaBoost, GBDT\text {GBDT}GBDT, If you are not familiar with the basic decision tree algorithm, I recommend reading another blog post: Decision tree model: ID3\text {ID3}ID3, C4.5\text {C4.5}C4.5, CART\text {CART}CART algorithm introduction





Integrated learning

  • The goal of traditional machine learning is to train a “strong learner” (decision tree, neural network, etc.) with the highest prediction accuracy as possible. However, it is difficult to find an optimal “strong learner” in practice, so “integrated learning” is proposed. “Ensemble learning” believes that for multiple “weak learners” with low prediction accuracy (only requiring a prediction accuracy slightly higher than random guess), if these “weak learners” are combined in an appropriate way, they can be promoted to “strong learners” and achieve higher prediction accuracy. Based on this idea, Schapire first constructed a polynomial-level algorithm in 1990, which proved the problem positively. This algorithm is Boosting, the original algorithm. Later developed, ensemble learning has become a very mature and widely used machine learning method. Its family members include RF, GBDT, XGboost and many other excellent algorithms.

  • There are three main types of integrated learning: Bagging\text {Bagging}Bagging(Boosting), Boosting\text {Stacking} Boosting(Stacking), Stacking\text {Stacking}Stacking (Stacking) Boosting\text {Boosting}Boosting method and related algorithms (AdaBoost\text {AdaBoost}AdaBoost, GBDT\text {GBDT}GBDT).


Boosting \text {Boosting}
methods

  • Boosting\text {Boosting}Boosting unlike Bagging\text {Bagging}Bagging’s parallelization, Boosting\text {Boosting}Boosting is a step-like serial, which starts with the first base learner, Behind every base learning according to the last base learning right and wrong for the forecast of each sample and adjust the weight distribution of the sample, according to the prediction accuracy of base learning itself at the same time and adjust their weight value, get a series of different weight value of learning, and then will study the multiple weighted combination, get the final study. The specific steps are as follows:

  • 1. It first makes each sample have the same weight value (1/n, which is used to calculate the error rate or accuracy rate, and this weight value will not change in some classification algorithms) to build a weak classifier.

  • 2. Change the weight distribution of samples: increase the weight of the wrong samples divided by the classifier, and reduce the weight of the correct samples divided by the classifier; Change the weight value of the classifier: calculate the accuracy of the classifier, the higher the accuracy, the higher its weight value;

  • 3. Construct a new weak classifier according to the sample data after changing the weight distribution.

  • 4. Repeat steps 2 and 3 until a predetermined number of learners or a predetermined prediction accuracy is reached.

  • Intuitive understanding:

    • Boosting\text {Boosting}Boosting each step always focuses more on misclassified samples, making them easier to be classified correctly in the next step;
    • Boosting\text {Boosting}Boosting classifiers with higher accuracy have greater weights, enabling better classifiers to have greater say after the final linear combination. Unlike the Bagging\text {Bagging}Bagging method, where each classifier has the same weight.


AdaBoost \text {AdaBoost}
(Adaptive enhancement)

  • AdaBoost(Adaptive Boosting)\text {AdaBoost(Adaptive Boosting)}AdaBoost(Adaptive Boosting) is Boosting\text An introductory algorithm in {Boosting}Boosting. The original Adaboost algorithm was used to solve the dichotomous problem, so for a training set

T = { ( x 1 . y 1 ) . ( x 2 . y 2 ) . . ( x n . y n ) } T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \ldots,\left(x_{n}, y_{n}\right)\right\}
  • The xi ∈ X ⊆ Rn, yi ∈ Y = {- 1, + 1} x_ {I} \ \ mathcal in {X} \ subseteq \ mathbb {R} ^ {n}. Y_ {I} \ in \ mathcal {} Y = \ {1, + 1 \} xi ∈ X ⊆ Rn, yi ∈ Y = {- 1, + 1}, first initialize the weights of the training set

D 1 = ( w 11 . w 12 . . w 1 n ) w 1 i = 1 n . i = 1 . 2 . . n \ begin} {aligned D_ {1} & = \ left (w_ {11}, w_ {12}, \ ldots, w_ n {1} \ right) \ \ w_ {1} I & = \ frac {1} {n}, I = 1, 2, \ ldots, n \end{aligned}
  • According to the weight DmD_{m}Dm of each round of training set, TmT_{m}Tm is sampled to get the data of training set, and then the base learner HMH_ {m} hm of each round is obtained according to TmT_{m}Tm training. The error of the base learner hmh_{m}hm is ϵm\epsilon_{m}ϵm, and the weight coefficient of the base learner in the final learner is calculated according to the error of the base learner

Alpha. m = 1 2 ln 1 ϵ m ϵ m \alpha_{m}=\frac{1}{2} \ln \frac{1-\epsilon_{m}}{\epsilon_{m}}
  • Update the weights of the training set

D m + 1 = ( w m + 1 . 1 . w m + 1 . 2 . . w m + 1 . n ) w m + 1 . i = w m . i Z m exp ( Alpha. m y i h m ( x i ) ) \ begin} {aligned D_ (m + 1} & = \ left (w_ (m + 1, 1}, w_ (m + 1, 2}, \ ldots, w_ (m + 1, n} \ right) \ \ w_ {I} m + 1, & = \ frac {w_ {m, i}}{Z_{m}} \exp \left(-\alpha_{m} y_{i} h_{m}\left(x_{i}\right)\right) \end{aligned}
  • ZmZ_{m}Zm is the normalization factor

Z m = i = 1 n w m . i exp ( Alpha. m y i h m ( x i ) ) Z_{m}=\sum_{i=1}^{n} w_{m, i} \exp \left(-\alpha_{m} y_{i} h_{m}\left(x_{i}\right)\right)
  • So Dm+1D_{m+1}Dm+1 is guaranteed to be a probability distribution. Finally, according to the constructed MMM base learner, the final learner is obtained:

h f ( x ) = sign ( m = 1 M Alpha. m h m ( x ) ) h_{f}(x)=\operatorname{sign}\left(\sum_{m=1}^{M} \alpha_{m} h_{m}(x)\right)
  • The following is a legend for an intuitive understanding
    AdaBoost \text {AdaBoost}
    :

  • 1. On the left, at the beginning, all data points have the same weight (plus and minus signs have the same size), and the first base learner divides the data sample into two parts. It can be seen that the data points marked with the three positive signs are incorrectly classified, so we give these three points more weight. At the same time, a weight value is given to the learner according to the accuracy of classification.

  • , middle figure 2, the second study, can see classification is not properly on a study of the weight of no. 3 (+) point has been increased, at this point the second base after learning machine according to the change of weight distribution of training data, get the second classification model, because of its classification will be three points (-) mark recognition error, so the next time the category, The three (-) points are given more weight. At the same time, a weight value is given to the learner according to the accuracy of classification.

  • 3. Repeat the same steps for the third base learner on the right.

  • 4. Finally, the weighted linear combination of the first three base learners will be used as the final strong learner, which will perform better than any of the previous weak classifiers.



GBDT \text {GBDT}
(Gradient ascending tree)


Forward stepwise algorithm and gradient lifting

  • GBDT(Gradient Boosting Decision Tree)\text {GBDT(Gradient Boosting Decision Tree)}GBDT(Gradient Boosting Decision Tree) and other types of ascending Tree models are based on “forward stepwise algorithms”. The forward step algorithm can be understood as follows: Suppose we want to use the decision tree to predict a person’s age. At the beginning, the model initialization will directly give a prediction value f0(x)f_{0}(x)f0(x). Note that this prediction value does not need to be trained by the decision tree, and may not be accurate (for example, the initial prediction of the model is 0 years old, Or give a relatively reasonable value based on the age distribution of the population). Then step on the model’s prediction based on training the first decision tree, this time the output of the model is the model initialization of predicted value and the value of the first output of the decision tree, then we’ll continue to add the second decision tree and make the decision tree in front of the constructed model foundation, let the loss to minimum overall, This process continues until the number of decision trees constructed meets the requirement or the total loss is less than a threshold.

  • When the forward-step algorithm reaches the MMM step, the prediction function can be expressed as:


    f m ( x ) = f m 1 ( x ) + Beta. m T ( x ; Θ m ) f_{m}(x)=f_{m-1}(x)+\beta_{m} T\left(x ; \Theta_{m}\right)
  • Including 1 FM – f_ (x)} {m – 1 FM – 1 (x) (x) is the first step 1 m – m – 1 m – 1 forecast function, and T (x; Θ m) T \ left (x; \Theta_{m}\right)T(x; θ M) is the MMM decision tree we need to construct at present, βm\beta_{m}βm represents learning rate (also known as step size). Since the previous m− 1m-1m-1 decision tree has been trained and the parameters are fixed, for the fixed βm\beta_{m}βm, we only need to train the parameter θ M \Theta_{m} θ M of the first MMM tree at the first step to minimize the current total loss:


Θ ^ m = arg min Θ m i = 1 N L ( y i . f m 1 ( x i ) + Beta. m T ( x i ; Θ m ) ) \hat{\Theta}_{m}=\underset{\Theta_{m}}{\arg \min } \sum_{i=1}^{N} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+\beta_{m} T\left(x_{i} ; \Theta_{m}\right)\right)
  • Where NNN represents the total number of samples and LLL represents the loss function. After the forward step algorithm is figured out, let’s review how loss functions are minimized in machine learning.
  • Suppose we have a loss function J(θ)J(\theta)J(θ). When we need to minimize the loss function with parameters θ\thetaθ, we can simply adjust the parameter θ\thetaθ in the direction of the negative gradient of the loss function, because the gradient direction is the direction in which the function grows the fastest. The function will decrease the fastest along the direction of the negative gradient. When the learning rate is ρ\rhoρ, θ\thetaθ can be updated as follows:

Theta. i : = Theta. i rho partial J partial Theta. i \theta_{i}:=\theta_{i}-\rho \frac{\partial J}{\partial \theta_{i}}
  • Then by the same token, the algorithm used in the forward step by step to make the overall loss J = ∑ iL (yi, FM (xi) – 1) J = \ sum_ \ {I} L left (y_ {I}. F_} {m – 1 \ left (x_ {I} \ right) \ right) J = ∑ iL (yi, FM – 1 (xi)), need to FM – 1 (x) f_ {1} m – 1 (x) (x) FM – derivation, FM −1(x) F_ {m-1}(x) FM −1(x) is for NNN samples, so the partial derivative of the predicted value of each sample is calculated:

f m ( x i ) : = rho partial J partial f m 1 ( x i ) f_{m}\left(x_{i}\right):=-\rho \frac{\partial J}{\partial f_{m-1}\left(x_{i}\right)}
  • βm\beta_{m}βm =1\rho=1 \rho=1 \beta_{m}βm For the decision tree no. MMM, it no longer fits yiy_{I}yi of the original data (xi,yi)\left(x_{I}, y_{I}\right)(xi,yi), but the negative gradient, because this can make the overall loss function decline fastest. Therefore, for the decision tree no. MMM, the sample data set to be fitted is updated as follows:

{ ( x 1 . partial J partial f m 1 ( x 1 ) ) . ( x 2 . partial J partial f m 1 ( x 2 ) ) . . ( x n . partial J partial f m 1 ( x n ) ) } . m = 1 . 2 . . M \left\{\left(x_{1},-\frac{\partial J}{\partial f_{m-1}\left(x_{1}\right)}\right),\left(x_{2},-\frac{\partial J}{\partial f_{m-1}\left(x_{2}\right)}\right), , \ \ cdots left (x_ {n}, – \ frac {\ partial J} {\ partial f_} {m – 1 \ left (x_ {n} \ right)} \ right) \ right \}, m = 1, 2, \ \ cdots, m
  • After iteration to a certain step, the loss function is small enough, and finally GBDT\text {GBDT} the predicted value of GBDT output is the sum of all the previous trees, which will be very close to yyY value.



GBDT \text {GBDT}
Regression algorithm

  • For regression problems, we take the square error loss function as an example to introduce the treatment of GBDT. Squared error loss function:

L ( y . f ( x ) ) = 1 2 ( y f ( x ) ) 2 L(y, f(x))=\frac{1}{2} \cdot(y-f(x))^{2}
  • For the m-1m-1m-1 decision tree, the loss function of each sample III is: J = L (yi, FM (xi) – 1) J = L \ left (y_ {I}, f_} {m – 1 \ left (x_ {I} \ right) \ right) J = L (yi, FM – 1 (xi)), The gradient (partial derivative) of FM −1(xi)f_{m-1}\left(x_{I}\right) FM −1(xi) is:

partial J partial f m 1 ( x i ) = partial i L ( y i . f m 1 ( x i ) ) partial f m 1 ( x i ) = partial L ( y i . f m 1 ( x i ) ) partial f m 1 ( x i ) = f m 1 ( x i ) y i \frac{\partial J}{\partial f_{m-1}\left(x_{i}\right)}=\frac{\partial \sum_{i} L\left(y_{i}, f_{m-1}\left(x_{i}\right)\right)}{\partial f_{m-1}\left(x_{i}\right)}=\frac{\partial L\left(y_{i}, f_{m-1}\left(x_{i}\right)\right)}{\partial f_{m-1}\left(x_{i}\right)}=f_{m-1}\left(x_{i}\right)-y_{i}
  • Then the corresponding negative gradient is:

partial J partial f m 1 ( x i ) = y i f m 1 ( x i ) -\frac{\partial J}{\partial f_{m-1}\left(x_{i}\right)}=y_{i}-f_{m-1}\left(x_{i}\right)

Yi − FM −1(xi)y_{I}-f_{m-1}\left(x_{I}\right)yi− FM −1(xi) is the Residual that the current decision tree needs to fit. Therefore, for the ascending tree model of regression problem, each decision tree only needs to fit the Residual.

Note: FM −1(x)f_{m-1}\left(x\right) FM −1(x) is not the m− 1m-1m-1 decision tree, but the prediction function at the m− 1m-1m-1m-1 decision tree step, In the same way, FM −1(xi)f_{m-1}\left(x_{I}\right) FM −1(xi) represents the predicted value of the prediction function on sample Xix_Ixi. The m− 1m-1m-1 decision tree only fits the negative gradient, while the prediction function FM −1(x)f_{m-1}\left(x\right) FM −1(x) is the negative gradient of the fitting plus the predicted value of the previous tree.

  • GBDT method process of regression problem:

    • Input: Training data set T = {(x1, y1), (x2, y2),…, xN, yN)}, xi ∈ X ⊆ Rn, yi ∈ ⊆ RT = Y \ left \ {\ left (x_ {1}, y_ {1} \ right), \ left (x_ {2}, y_ {2} \ right), \cdots,\left(x_{N}, y_{N}\right)\right\}, x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{\mathrm{n}}, \ mathbf {y} _ {\ mathbf {I}} \ in \ mathcal {} y \ subseteq \ mathbf {R} \ quadT = {(x1, y1), (x2, y2),…, xN, yN)}, xi ∈ X ⊆ Rn, yi ∈ ⊆ R y, And the learning rate βm\beta_{m}βm (also called step size)

    • FM (x)f_{M}(x)fM(x)

    1. Initialize the f0 N (x) = 1 ∑ iNyif_ {0} (x) = \ frac {1} {N} \ sum_ {I} ^ y_ {N} {I} f0 (x) = N1 ∑ iNyi

    2. For the decision tree m=1,2… Mm=1,2 \cdots, Mm=1,2… m

      (a) Calculate the residual of each sample I:


      r m i = y i f m 1 ( x i ) . i = 1 . 2 . . N R_ {m} I = y_ {I} – f_} {m – 1 \ left (x_ {I} \ right), I = 1, 2, \ \ cdots, N

      (b) learning a regression tree by fitting residuum r_{mi}, we get the region Rmj,j=1,2… JR_{mi j}, j=1,2, \cdots, JRmj,j=1,2… j. Note: The decision tree (CART\text {CART}CART regression tree) is also split internally with squared error as a loss function.

      (c) the leaf node j = 1, 2,…, Jj = 1, 2, \ \ cdots, Jj = 1, 2,…, j, computing the optimal output value:


      c m j = arg min c x i R m j L ( y i . f m 1 ( x i ) + c ) c_{m j}=\arg \min _{c} \sum_{x_{i} \in R_{m j}} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+c\right)

      This step is to calculate how to assign values to the samples in the leaf node region to minimize the loss of the model, so this step needs to be calculated according to the loss function of the current decision tree, while the loss function of \text {CART} regression tree is the loss of square error, so the calculation result of this step is:


      c m j = 1 N m j x i R m j r m i c_{m j}=\frac{1}{N_{m j}} \sum_{x_{i} \in R_{m j}} r_{m i}

      Namely, the mean of residuals in each leaf node region.

      (d) to update the decision tree FM FM – 1 (x) (x) = ∑ j = 1 + beta m jcmji (x ∈ Rmj) f_ (x) = {m} f_ {1} m – (x) + \ beta_ {m} \ sum_ ^ {j = 1} {j} c_ {m j} \ I left (x \ in R_ {m J} \ right) FM FM – 1 (x) (x) = ∑ j = 1 + beta m jcmji (x ∈ Rmj).

    3. Finally, the gradient lifting tree for regression problem is obtained:


      f M ( x ) = m = 1 M Beta. m j = 1 J c m j I ( x R m j ) f_{M}(x)=\sum_{m=1}^{M} \beta_{m} \sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right)

      Note: The range of \beta_{m} is (0,1), Its function is the same as the learning rate (also called step size) in the gradient descent algorithm, which can make the loss function converge to a smaller value, so it can be adjusted to a smaller value \beta_{m}, and a smaller learning rate will make the model tend to train more decision trees, so that the model is more stable and will not overfit. Therefore, \beta_{m} can also be used as a regularization parameter in GBDT.



GBDT \text {GBDT}
Binary classification algorithm

  • Since GBDT\text {GBDT}GBDT’s base decision tree adopts CART regression tree with square error as splitting criterion, and its output of leaf nodes is the whole range of real numbers, We can refer to the method that the logistic regression algorithm nested sigmoid sigmoID function (logistic regression function) on the basis of linear regression expression, so that GBDT can deal with the classification problem. GBDT\text {GBDT}GBDT binary processing.

  • We can define the loss function of binary GBDT as:


L ( y . f ( x ) ) = y log y ^ ( 1 y ) log ( 1 y ^ ) y ^ = 1 1 + e f ( x ) L(y, f(x))=-y \log \hat{y}-(1-y) \log (1-\hat{y}) \\ \hat{y}=\frac{1}{1+e^{-f(x)}}
  • Y ^\hat{y}y^ is a sigmoid sigmoid function that we use to represent the probability that the decision tree predicts the sample classes to be one P(y=1∣x)P(y=1 \mid x)P(y=1∣x); Yyy is the actual value (0 and 1) of the sample category.

  • For m m m – 1-1-1 of decision tree, loss function J = ∑ iL (yi, FM (xi) – 1) J = \ sum_ \ {I} L left (y_ {I}, f_} {m – 1 \ left (x_ {I} \ right) \ right) J = ∑ iL (yi, FM – 1 (xi)), The gradient of sample predicted value can be calculated as follows:


    partial J partial f m 1 ( x i ) = partial L ( y i . f m 1 ( x i ) ) partial f m 1 ( x i ) = partial [ y i log 1 1 + e f m 1 ( x i ) ( 1 y i ) log ( 1 1 1 + e f m 1 ( x i ) ) ) partial f m 1 ( x i ) = partial [ y i log ( 1 + e f m 1 ( x i ) ) + ( 1 y i ) [ f m 1 ( x i ) + log ( 1 + e f m 1 ( x i ) ) ] ] partial f m 1 ( x i ) = partial [ ( 1 y i ) f m 1 ( x i ) + log ( 1 + e f m 1 ( x i ) ) ] partial f m 1 ( x i ) = 1 1 + e f m 1 ( x i ) y i = y i ^ y i \begin{aligned} \frac{\partial J}{\partial f_{m-1}\left(x_{i}\right)} &=\frac{\partial L\left(y_{i}, f_{m-1}\left(x_{i}\right)\right)}{\partial f_{m-1}\left(x_{i}\right)} \\ &=\frac{\partial\left[-y_{i} \log \frac{1}{1+e^{-f_{m-1}\left(x_{i}\right)}}-\left(1-y_{i}\right) \log \left(1-\frac{1}{\left.1+e^{-f_{m-1}\left(x_{i}\right)}\right)}\right)\right.}{\partial f_{m-1}\left(x_{i}\right)} \\ &=\frac{\partial\left[y_{i} \log \left(1+e^{-f_{m-1}\left(x_{i}\right)}\right)+\left(1-y_{i}\right)\left[f_{m-1}\left(x_{i}\right)+\log \left(1+e^{-f_{m-1}\left(x_{i}\right)}\right)\right]\right]}{\partial f_{m-1}\left(x_{i}\right)} \\ &=\frac{\partial\left[\left(1-y_{i}\right) f_{m-1}\left(x_{i}\right)+\log \left(1+e^{-f_{m-1}\left(x_{i}\right)}\right)\right]}{\partial f_{m-1}\left(x_{i}\right)} \\ &=\frac{1}{1+e^{-f_{m-1}\left(x_{i}\right)}}-y_{i} \\ &=\hat{y_{i}}-y_{i} \end{aligned}
  • The calculated result is the difference between the category prediction probability value of the decision tree and the actual category value, then the negative gradient is:


partial J partial f m 1 ( x i ) = y i y ^ i -\frac{\partial J}{\partial f_{m-1}\left(x_{i}\right)}=y_{i}-\hat{y}_{i}
  • Therefore, for GBDT, which deals with dichotomies, each decision tree fits the difference between the actual value of the category and the probabilistic predicted value of the category. The PROCESS of GBDT method for dichotomous problems is as follows:

    • Input: The training data set T = {(x1, y1), (x2, y2),…, xN, yN)}, xi ∈ X ⊆ Rn, yi ∈ {0, 1} \ quad T = \ left \ {\ left (x_ {1}, y_ {1} \ right), \ left (x_ {2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}, x_{i} \in \mathcal{X} \subseteq \mathbf{R}^{\mathrm{n}}, \ mathbf {y} _ {\ mathbf {I}} \ in \ {\ mathbf {0}, \ mathbf {1} \} \ quadT = {(x1, y1), (x2, y2),…, xN, yN)}, xi ∈ X ⊆ Rn, yi ∈ {0, 1}, And the learning rate βm\beta_{m}βm
    • FM (x)f_{M}(x) fM(x)
    1. Initialize f0 (x) = log ⁡ p11 – p1f_ {0} (x) = \ log \ frac {p_ {1}} {1 – p_ {1}} f0 (x) = log1 – p1p1, including p1p_ {1} p1 represent a sample of the proportion of y = 1.

    2. For m=1,2… Mm=1,2 \cdots, Mm=1,2… m

      (a) Calculate the negative gradient for each sample I:


      r m i = y i y ^ i . i = 1 . 2 . . N R_ {m I}=y_{I}- hat{y}_{I}, I =1,2, \cdots, N

      With y ^ I = 11 + FM – e – 1 (xi) \ hat {y} _ {I} = \ frac {1} {1 + e ^ {- f_} {m – 1 \ left (x_ {I} \ right)}} y ^ I = 1 + FM – e – 1 (xi), on behalf of y_ {I} = 1.

      (b) by fitting the negative gradient r_{mi}, we learn a regression tree and get the region of the MTH tree Rmj,j=1,2… JR_{mi}, j=1,2, \cdots, j Rmj,j=1,2… Note that this step also splits the nodes in terms of squared error loss.

      (c) for leaves j=1,2… Jj=1,2, \cdots, Jj=1,2… j, calculate:


      c m j = arg min c x i R m j L ( y i . f m 1 ( x i ) + c ) c_{m j}=\arg \min _{c} \sum_{x_{i} \in R_{m j}} L\left(y_{i}, f_{m-1}\left(x_{i}\right)+c\right)

      (d) update FM FM – 1 (x) (x) = ∑ j = 1 + beta m jcmji (x ∈ Rmj) f_ (x) = {m} f_ {1} m – (x) + \ beta_ {m} \ sum_ ^ {j = 1} {j} c_ {m j} \ I left (x \ in R_ j {m} \ right) FM FM – 1 (x) (x) = ∑ j + beta m = 1 jcmji (x ∈ Rmj)

    3. Finally, the binary gradient lifting tree is obtained:


      f M ( x ) = m = 1 M Beta. m j = 1 J c m j I ( x R m j ) f_{M}(x)=\sum_{m=1}^{M} \beta_{m} \sum_{j=1}^{J} c_{m j} I\left(x \in R_{m j}\right)
    4. Finally, the final prediction function should be nested with logical functions, so that the prediction probability value can be expressed:


      P ( y = 1 x ) = 1 1 + e f M ( x ) P(y=1 \mid x)=\frac{1}{1+e^{-f_M(x)}}



GBDT \text {GBDT}
Multiple classification algorithm

  • GBDT uses the method of nesting a logistic regression function on the original regression tree model, GBDT/TEXT {GBDT} adopts the method of “Multiple logistic Regression” (Softmax Regression {Softmax Regression}Softmax Regression) on the problem of multiple classification. Here is a brief introduction to multinomial logistic regression.

  • “Multinomial Regression\text {Softmax Regression}Softmax Regression” can be seen as Logistic Regression\text {Logistic Regression}Logistic Regression), and conversely, Logistic Regression can also be regarded as a special case of multinomial Logistic Regression. In Logistic Regression\text {Logistic Regression} two probabilities need to be solved: P(y=1∣x; Theta) P (y = 1 \ mid x; ∣ \ theta) P (y = 1 x; Theta) and P (y = 0 ∣ x; Theta) P (y = 0 \ mid x; ∣ \ theta) P (y = 0 x; Theta); In Softmax Regression\text {Softmax Regression}, there will no longer be two probabilities, but KKK probabilities, and KKK represents the number of categories. Softmax Regression\text {Softmax Regression}Softmax Regression prediction function:


    h Theta. ( x ) = P ( Y = y i x ) = [ P ( Y = 1 x ; Theta. ) P ( Y = 2 x ; Theta. ) P ( Y = k x ; Theta. ) ] = 1 j = 1 k e Theta. j T x [ e Theta. 1 T x e Theta. 2 T x e Theta. k T x ] \begin{array}{c} h_{\theta}(x) =P\left(Y=y_{i} \mid x\right)= {\left[\begin{array}{c} P(Y=1 \mid x ; \theta) \\ P(Y=2 \mid x ; \theta) \\ \cdot \\ \cdot \\ \cdot \\ P(Y=k \mid x ; \theta) \end{array}\right]} =\frac{1}{\sum_{j=1}^{k} e^{\theta_{j}^{T} x}}\left[\begin{array}{c} e^{\theta_{1}^{T} x} \\ e^{\theta_{2}^{T} x} \\ \cdot \\ \cdot \\ \cdot \\ e^{\theta_{k}^{T} x} \end{array}\right] \end{array}
  • Among them, the theta 1 t, theta, 2 t… , theta kT \ theta_ {1} ^ T, \ theta_ {2} ^ T, \ ldots, \ theta_ {k} 1 T ^ T theta, theta, 2 T… ,θkT is the parameters (vectors) of KKK models respectively, Finally multiplied by 1 ∑ j = 1 ke theta jTxi \ frac {1} {\ sum_ ^ {j = 1} {k} e ^ {\ theta_ {j} ^ {T} x_ {I}}} ∑ j = 1 ke theta jTxi1 in order to make the probability of each category in [0, 1] [0, 1] and [0, 1] interval probability sum to 1. Softmax Regression\text {Softmax Regression}Softmax Regression assigns the probability of input data XIx_ {I}xi to category JJJ as follows:


p ( y i = j x i ; Theta. ) = e Theta. j T x i l = 1 k e Theta. l T x i p\left(y_{i}=j \mid x_{i} ; \theta\right)=\frac{e^{\theta_{j}^{T} x_{i}}}{\sum_{l=1}^{k} e^{\theta_{l}^{T} x_{i}}}
  • The above formula can be visually interpreted as follows:



  • In general, Softmax Regression\text {Softmax Regression}Softmax Regression is characterized by parameter redundancy, i.e. θ1T,θ2T… , theta kT \ theta_ {1} ^ T, \ theta_ {2} ^ T, \ ldots, \ theta_ {k} 1 T ^ T theta, theta, 2 T… And then we have the same thing because P(Y=1∣x) plus P(Y=2) plus ∣ x) + P (Y = k = 1 P (Y = 1 \ mid x) + P (Y = 2 \ mid x) + \ ldots + P (Y = k \ mid x) = 1 P (Y = 1 ∣ x) + P (Y = 2 x ∣) +… Plus P(Y=k∣x)=1, so P(Y=1∣x)=1−P(Y=2)−… ∣ x) – P (Y = k P (Y = 1 \ mid x) = 1 – P (Y = 2 \ mid x) – \ ldots – P (Y = k \ mid x) P (Y = 1 ∣ x) = 1 – P (Y = 2 x ∣) -… ∣ x) – P (Y = k. Hypothesis from the parameter vector theta jT \ theta_ {j} ^ {T} subtract vector bits \ theta jT psi bits, then each theta jT \ theta_ {j} ^ {T} theta. JT is theta jT – bits (j = 1, 2,… , k) \ theta_ {j} ^ {T} – \ psi (j = 1, 2, \ ldots, k) theta jT – bits (j = 1, 2,… , k). Now the prediction function becomes:


    P ( Y = y j x ; Theta. ) = e Theta. j T x i = 1 k e i T i x = e ( Theta. j T Bits of ) x i = 1 k e ( Theta. i T Bits of ) x = e Theta. j T x x e Bits of x i = 1 k e Theta. i T x x e Bits of x = e Theta. j T x i = 1 k e Theta. i T x \begin{aligned} P\left(Y=y_{j} \mid x ; \theta\right) &=\frac{e^{\theta_{j}^{T} x}}{\sum_{i=1}^{k} e_{i}^{T_{i} x}} \\ &=\frac{e^{\left(\theta_{j}^{T}-\psi\right) x}}{\sum_{i=1}^{k} e^{\left(\theta_{i}^{T}-\psi\right) x}} \\ &=\frac{e^{\theta_{j}^{T} x} \times e^{-\psi x}}{\sum_{i=1}^{k} e^{\theta_{i}^{T} x} \times e^{-\psi x}} \\ &=\frac{e^{\theta_{j}^{T} x}}{\sum_{i=1}^{k} e^{\theta_{i}^{T} x}} \end{aligned}
  • It can be seen from the above equation that the ψ\psiψ subtracted from θjT\theta_{j}^{T}θjT has no effect on the prediction result of the hypothesis function. In particular, when the class number KKK is 2:


    h Theta. ( x ) = 1 e Theta. 1 T x + e Theta. 2 T x [ e Theta. 1 T x e Theta. 2 T x ] h_{\theta}(x)=\frac{1}{e^{\theta_{1}^{T} x}+e^{\theta_{2}^{T} x}}\left[\begin{array}{l} e^{\theta_{1}^{T} x} \\ e^{\theta_{2}^{T} x} \end{array}\right]
  • Using the characteristic of parameter redundancy, we subtract θ1\theta_{1}θ1 from all the parameters of the above equation, and the above equation becomes:


h Theta. ( x ) = 1 e 0 x + e ( Theta. 2 T Theta. 1 T ) x [ e 0 x e ( Theta. 2 T Theta. 1 T ) x ] = [ 1 1 + e T T x 1 1 1 + e Theta. T x ] \begin{aligned} h_{\theta}(x) &=\frac{1}{e^{0 \cdot x}+e^{\left(\theta_{2}^{T}-\theta_{1}^{T}\right) x}}\left[\begin{array}{c} e^{0 \cdot x} \\ e^{\left(\theta_{2}^{T}-\theta_{1}^{T}\right) x} \end{array}\right] \\ &=\left[\begin{array}{c} \frac{1}{1+e^{T^{T} x}} \\ 1-\frac{1}{1+e^{\theta T} x} \end{array}\right] \end{aligned}
  • Including theta equals theta 2-1 \ theta equals theta \ theta_ {2} – \ theta_ {1} theta equals theta 2 – theta. 1. It can be seen that the sorted formula is completely consistent with logistic regression. Therefore, the binary logistic regression can actually be regarded as a special case of multinomial logistic regression when K =2k= 2K =2.

  • GBDT\text {GBDT}GBDT {GBDT}

    • Logistic regression model should be considered when applying GBDT to binary classification problem. Similarly, Softmax model should be considered as follows for GBDT multi-classification problem:

    P ( y = 1 x ) = e F 1 ( x ) i = 1 k e F i ( x ) P ( y = 2 x ) = e F 2 ( x ) i = 1 k e F i ( x ) P ( y = k x ) = e F k ( x ) i = 1 k e F i ( x ) \begin{array}{c} P(y=1 \mid x)=\frac{e^{F_{1}(x)}}{\sum_{i=1}^{k} e^{F_{i}(x)}} \\ P(y=2 \mid x)=\frac{e^{F_{2}(x)}}{\sum_{i=1}^{k} e^{F_{i}(x)}} \\ \ldots \ldots \\ P(y=k \mid x)=\frac{e^{F_{k}(x)}}{\sum_{i=1}^{k} e^{F_{i}(x)}} \end{array}
    • The formula one… FkF_ {1} \ ldots F_ {k} F1… Fk is the integration of KKK different CART regression trees. Each round of training actually trained KKK trees to fit the negative gradient of each branch model of SoftMax. The single-sample loss function of Softmax model is:

     loss  = i = 1 k y i log P ( y i x ) = i = 1 k y i log e F i ( x ) j = 1 k e F j ( x ) \text { loss }=-\sum_{i=1}^{k} y_{i} \log P\left(y_{i} \mid x\right)=-\sum_{i=1}^{k} y_{i} \log \frac{e^{F_{i}(x)}}{\sum_{j=1}^{k} e^{F_{j}(x)}}
    • The yi (I = 1… K) y_ {I} \ ldots (I = 1 k) yi (I = 1… K) is the value of the original result label of the sample after one-hot encoding in K categories, which belongs to class III in KKK categories, yi=1y_i=1yi=1 and the rest are 0. It is not difficult to deduce from the above expression:

    partial l o s s partial F i = y i e F i ( x ) j = 1 k e F j ( x ) = y i p ( y i x ) -\frac{\partial l o s s}{\partial F_{i}}=y_{i}-\frac{e^{F_{i}(x)}}{\sum_{j=1}^{k} e^{F_{j}(x)}}=y_{i}-p\left(y_{i} \mid x\right)
    • It can be seen that the KKK tree also fits the difference between the real label and the predicted probability of the sample, which is very similar to the process of GBDT binary classification.



GBDT \text {GBDT}
with
BDT \text {BDT}

  • Boosting Decision Tree (BDT)\text {BDT(Boosting Decision Tree)}BDT(Boosting Decision Tree), although the gradient Boosting Tree is an improved version of the Boosting Tree, However, we can also regard the lifting tree as a special case when the loss function in the gradient lifting tree takes the square error loss, so the lifting tree only needs to fit the residual error of the last step in each training decision tree.

  • However, using residual as loss function actually has great disadvantages, one of which is too sensitive to outliers. Let’s look at an example:

  • Obviously, the subsequent model will pay too much attention to the fourth value, which is not a good phenomenon. Therefore, the loss function of the general regression class will replace the square loss function with absolute loss or Huber loss function:










If you have any questions, please leave a message.

Finally, if you’re interested in Python, data mining, machine learning, and more, please check out my blog.