XGBOOST profile

XGBOOST: Simply put, a model that integrates a number of base learners, such as the Cart decision tree. Boosting is a classical method of ensemble learning, and it is a great tool widely used in industry and competition.

Ensemble learning is a method that combines some base learners to improve their generalization ability and robustness. There are two main integration ideas:

  1. Bagging: Training base learners independently and (weighted) averaging their predictions. E.g. Random Forests;
  2. Boosting: A program that trains base learners one after another, each of which is used to correct the errors of the preceding ones. Such as AdaBoost, GBDT and XGBOOST;

(note: there are also stacking methods, but stacking is considered more of a fusion strategy;)

First, base learner — decision tree

Decision trees are nonlinear, have strong fitting ability and can be adjusted quickly by pruning, so ensemble learning usually chooses decision trees as base learners. (Note: The base learner in XGBOOST can be CART regression tree – GBTree or linear classifier – Gblinear. This paper focuses on the tree model.

Decision tree is a simple machine learning regression/classification method. It is composed of if-then decision structures in a tree, with leaf nodes representing the final predicted value or category. Typical decision tree models are: ID3, C4.5 and CART.Decision tree algorithm can be summarized into two stages:

  • Tree growth: the idea is to build a tree recursively from top to bottom, which is a process of feature selection and division based on certain indicators such as (information gain, information gain ratio, GINi, square error).

  • The tree pruning: Decision tree is easy to over-fit data, that is, to grow tree model with too complex structure. The pruning algorithm can reduce the complexity and reduce the risk of overfitting.

The fundamental goal of the decision tree pruning algorithm is to minimize the loss function structural damage (+ experience loss), basic pre pruning strategy is “and” after pruning “pruning: of two strategies: (1) is in the process of decision tree generation, limit of maximum depth, leaf node number and the minimum sample size, etc., in order to reduce unnecessary complexity model; (2) Post-pruning: a complete decision tree is first generated from the training set, and then the verification set is used to investigate the non-leaf nodes from bottom to top. If the sub-tree corresponding to the node is replaced by leaf nodes (pruning), the generalization performance of the decision tree can be improved (namely, the loss of the objective function is smaller, and the commonly used objective function is as follows: Loss experience loss = model bias, loss to alpha | T | + model structure, T for the node number of alpha coefficient), replace the subtree as leaf nodes.

Regression tree from Cart to GBDT

CART regression tree is a binary tree structure decision tree, GBDT, XGBoost and other gradient lifting methods use CART regression tree as the base learner. The growth of trees is divided by selecting features of squared error index and cutting points. That is to traverse all the segmentation points of all features, minimize the objective function, select the appropriate tree segmentation feature (j) and feature threshold (s) to find the optimal segmentation feature and segmentation point, and finally obtain a regression tree.

For example, the tree nodes in the figure below are split based on feature (age). Let the samples whose eigenvalue is less than threshold (20) be divided into left subtree and other samples into right subtree.

GBDT(Gradient Elevation decision Tree) XGBOOST improves efficiency on the basis of GBDT. GBDT(Gradient Boosting Decision Tree) is a Boosting Decision Tree for Xgboost. The principle of GBDT serial learning can be divided into three steps:

  1. Initialize to fit the first Cart regression tree by learning the data set. The Tree1 tree in the figure below learns to predict the true value Y, and the final model predicts the output Y1.
  2. The next Cart regression tree is fitted by learning the residual of the previous tree (residual is the error between the predicted value and the true value, and the key in GBDT algorithm is to use the negative gradient of the loss function as the approximate value of the residual), and then the base learner learns the Cart tree sequentially. Tree2 in the figure below learns the negative gradient data (y-y1) of Tree1’s loss function.
  3. The final model prediction value is the sum of the prediction results of all serial Cart regression tree outputs.

XGBOOST (eXtreme Gradient Boosting)

XGBOOSTSimilar to GBDT serial learning, treE1 and TREE2 are learned as shown in the following figure. The prediction is to add the results of TREE1 and TREE2.

The main differences between XGBoost and GBDT are as follows:

  • The loss function adds the regular term OMEGA

The regularization term penalizes the number of nodes and the weight of leaf nodes to reduce the overfitting of the model. The loss function is shown in the figure below:

  • By Taylor Taylor expansion, tree growth is directly linked to the loss function

Xgboost uses a second-order Taylor expansion to apply a custom loss function, obj, using Taylor expansion terms to make an approximation.

It is clear that the final objective function depends only on the first derivative gi and second derivative Hi of each data point on the error function. The derivative of this objective function obj is equal to 0, giving a leaf node weight w*

Substitute obj to get the expression of the value of the leaf node

Each part in the objective function OBj represents the contribution degree of each leaf node to the current model loss. After the fusion, the calculation expression of Gain is obtained as follows:

Process of the growth of tree, which is using the split criteria of expression is deduced for all features do it again from left to right scan can enumerate all segmentation point of gradient and GL and GR values, and then compute the Gain of the formula to calculate each division of the scores and choose to Gain maximum segmentation scheme, split after the calculation of its corresponding value w * leaf nodes.

  • Improve efficiency in feature granularity

One of the most time-consuming steps in decision tree learning is to sort the values of features to determine the optimal segmentation point. XGBoost sorts the data in advance before training, and then saves it into a block structure, which is repeatedly used in subsequent iterations. In addition, when nodes are split, multiple threads are opened to realize the parallel calculation of the gain of each feature partition point, which greatly improves the calculation efficiency.

The resources

XGBOOST paper

XGBOOST PPT

About the author

Learn more about Python, machine learning, deep learning, etc