GBDT/XGBoost Common problems

Original author: Jack Stack\

Original text: zhuanlan.zhihu.com/p/81368182

In the non-deep learning machine learning model, XGBoost and LightgBM based on GBDT algorithm have very excellent performance, and the “out of the picture” rate is very high in the job interview of school recruitment algorithm. There are a lot of materials in this area, so this paper is not original. It summarizes GBDT related issues by referring to many classics and interpretation articles.

The principle of XGBoost

I suggest reading Chen Tianqi’s PPT and the original paper directly. Because the theory is complicated, you need to go back and read it from time to time.

Here’s how XGBoost works

XGBoost is an algorithm or engineering implementation based on GBDT.

GBDT is an addition model based on Boosting integration idea. The forward distribution algorithm is used for greedy learning during training, and a CART tree is learned in each iteration to fit the residuals between the predicted results of the previous T-1 tree and the real values of the training samples.

The basic idea of XGBoost is the same as GBDT, but some optimizations have been made, such as default missing value processing, the addition of second derivative information, regular terms, column sampling, and parallel computation.

Differences between XGBoost and GBDT:

GBDT is a machine learning algorithm and XGBoost is an engineering implementation of this algorithm.
When using CART as a base classifier, XGBoost explicitly adds regularization terms to control the model’s complexity, which helps prevent overfitting and thus improves the model’s generalization ability.
GBDT only uses the first-order derivative information of the cost function in model training, XGBoost carries out second-order Taylor expansion of the cost function, and the first-order and second-order derivatives can be used simultaneously. (Benefits: Compared with GBDT’s first-order Taylor expansion, XGBoost adopts second-order Taylor expansion, which can more accurately approximate the real loss function. Note that the loss function needs to be second differentiable.)
Whereas traditional GBDT uses CART as a base classifier, XGBoost supports many types of base classifiers, such as linear classifiers.
Whereas traditional GBDT uses all data in each iteration, XGBoost adopts a strategy similar to random forest, supporting column sampling of data.
Traditional GBDT does not involve processing missing values, XGBoost can automatically learn the missing value processing strategy.
Parallelization on feature dimensions. XGBoost will arrange each feature according to the feature value in advance and store it as a block structure. When splitting nodes, multithreading can be used to find the best segmentation point of each feature in parallel, which greatly improves the training speed.

GBDT gradient related problems

What is the gradient in GBDT versus what gradient?

The gradient of the current loss function L(yi, F(x)) to the tree F(xi).

Given a data set with m samples, n-dimensional features, if we use LR, what is the gradient?

It has n dimensions for the weight W and 1 dimension for bias, so n+1 dimension.

M by n data set, if I use GBDT, what is the gradient? M d? N d? M * n? Or is it something to do with the depth of the tree? Or is it related to the number of leaves in the tree?

I’m not sure???? , added later.

How does XGBoost parallelism work?

Since XGBoost is a boosting algorithm, the tree training is serial and cannot be parallel. Parallelism here refers to parallelism of feature dimensions. Before training, each feature is pre-sorted according to the feature value of the sample and stored as a Block structure, which can be used repeatedly in the later search for feature segmentation points. Moreover, features have been stored as a Block structure, so when searching for the optimal segmentation point of each feature, multi-threading can be used for parallel calculation of each Block.

What are the ways that XGBoost algorithm prevents overfitting?

Regularization has been added to the objective function. L2 regularization of number of leaf nodes + weight of leaf nodes.
Column sample. Use only a few characteristics when training.
Sampling. Each round can be calculated without using all samples, similar to bagging.
Early stopping. If there is no improvement on the validation set after a fixed number of iterations, stop the training process.
Shrinkage. Lower the learning rate to increase the number of trees in order to make more room for later training.

When using XGBoost training model, how to adjust parameters if overfitting?

Control the complexity of the model. Including max_depth,min_child_weight, and gamma.
Randomness is added to make the model insensitive to noise during training. Including the subsample, colsample_bytree.
Reduce the learning rate, but increase the Estimator parameter at the same time.

What are XGBoost’s regular terms?

L2 regularization of the number of leaf nodes and the weight of leaf nodes.

Why is XGBoost insensitive to missing values?

Some models involving the measurement of sample distance, such as SVM and KNN, will eventually lead to poor prediction results if the missing values are not handled properly.

However, tree model has low sensitivity to missing values and can be used in most cases when data is missing. The reason is that each node in a tree is looking for the optimal splitting point (eigenvalue) of a feature when splitting, and samples with missing eigenvalue can be completely ignored. In other words, if the missing eigenvalue of some samples is missing, the influence on the search for the optimal splitting point is not great.

In addition, XGBoost has some methods for handling missing values.

How does XGBoost handle missing values?

This is an advantage of XGBoost. Specific treatment methods are as follows:

When searching for split nodes on a column feature, the missing samples will not be traversed, but only the eigenvalues on the non-missing samples will be traversed, thus reducing the time cost of searching for split nodes for sparse discrete features.
In addition, in order to ensure completeness, samples containing missing values will be assigned to left and right leaf nodes respectively, and then the direction with the maximum gain after splitting will be selected as the default branch direction of samples with missing eigenvalues in prediction.
If there are no missing values in the training set, but there are in the test set, the missing values are assigned to the right leaf node by default.

Stop growing conditions for a tree in XGBoost

When the tree reaches its maximum depth, it stops building the tree because the tree depth is too deep for overfitting, and a hyperparameter max_depth is required.
When the Gain of a newly introduced split is less than 0, the current split is abandoned. This is the game process of training loss and model structural complexity.
When a split is introduced, the sample weights and of the newly generated left and right leaf nodes are recalculated. If the sample weight of any leaf node is lower than a certain threshold, the splitting will also be abandoned. This involves a hyperparameter: minimum sample weight sum, which means that if a leaf node contains too few samples, it will give up splitting to prevent the tree from being too thin.

XGBoost can do feature selection. How does it rate the importance of features?

There are three parameters in XGBoost that can be used to assess feature importance:

Weight: The total number of times this feature was used to split samples across all trees.
Gain: The average gain generated by this feature in all trees in which it occurs.
Cover: The average coverage of this feature across all trees where it appears. Coverage here refers to the number of samples affected by a feature after it is used as a segmentation point, that is, how many samples are divided into two sub-nodes through the feature.

Similarities and differences between random forest and GBDT

Similarities:

It’s made up of multiple trees, and the final result is determined by multiple trees.

Difference:

Integrated learning: RF belongs to Bagging idea, while GBDT is Boosting idea.
Bias – variance tradeoff: RF continuously reduces the variance of the model, while GBDT continuously reduces the bias of the model.
Training samples: RF samples of each iteration are formed by putting back samples from all training sets, while GBDT uses all samples each time.
Parallelism: RF trees can be generated in parallel, while GBDT trees can only be generated sequentially (waiting for the last tree to be fully generated).
End result: RF is ultimately a majority vote for multiple trees (regression problem is taking average), while GBDT is a weighted fusion.
Data sensitivity: RF is not sensitive to outliers, while GBDT is sensitive to outliers.
Generalization ability: RF is not easy to overfit, while GBDT is easy to overfit.

Difference between logistic regression (LR) and GBDT

LR is a linear model with strong explanatory ability and easy parallelization, but it has limited learning ability and requires a lot of artificial feature engineering
GBDT is a nonlinear model with natural feature combination advantages and strong feature expression ability. However, parallel training between trees is impossible, and tree model is easy to overfit.
For high-dimensional sparse data, GBDT is not as effective as LR.

How do you prune trees in XGBoost

A regular term is added to the objective function: the number of leaf nodes and the square of L2 module of weight of leaf nodes are used to control the complexity of the tree.
When the node is split, a threshold is defined. If the gain of the objective function after splitting is less than this threshold, the node is not split.
When a split is introduced, the sample weights and of the newly generated left and right leaf nodes are recalculated. If the sample weight of any leaf node is lower than a certain threshold (minimum sample weight sum), the splitting will also be abandoned.
XGBoost first builds the tree from top to bottom until it reaches its maximum depth, and then reversely checks from bottom to top to see if there are any nodes that do not meet the splitting conditions and prune the tree.

How does XGBoost select the best split point?

XGBoost pre-sorted the features according to their eigenvalues before training and stored them as block structures, which can be reused later when nodes split.

Therefore, the feature parallelism method can be used to calculate the optimal split point of each feature by multiple threads. According to the gain generated after each split, the eigenvalue of the feature with the maximum gain is finally selected as the optimal split point. If each sample is traversed when calculating the optimal segmentation point of each feature, the computational complexity will be very large. This global scanning method is not suitable for big data scenarios. XGBoost also provides a histogram approximation algorithm, which only selects constant candidate splitting locations as candidate splitting points after sorting features, greatly improving the computing efficiency of node splitting.

XGBoost is very malleable.

There are three aspects of malleability:

Base classifier: Weak classifiers can support CART decision trees as well as LR and Linear.
Objective function: Supports custom Loss function, requiring only its second-order differentiable. Because you need a second order Taylor expansion to get a general form of the objective function.
Learning method: Block structure supports parallelization and out-of-core calculation.

Machine Learning Online Manual Deep Learning online Manual AI Basics download (PDF updated to25Set) site QQ group1003271085To join the wechat group, please reply "Add group" to obtain a discount website knowledge planet coupon, copy the link directly open: HTTPS://t.zsxq.com/yFQV7am like the article, click on it
Copy the code

The principle of XGBoost

Here’s how XGBoost works

Differences between XGBoost and GBDT:

GBDT gradient related problems

What is the gradient in GBDT versus what gradient?

Given a data set with m samples, n-dimensional features, if we use LR, what is the gradient?

M by n data set, if I use GBDT, what is the gradient? M d? N d? M * n? Or is it something to do with the depth of the tree? Or is it related to the number of leaves in the tree?

How does XGBoost parallelism work?

What are the ways that XGBoost algorithm prevents overfitting?

When using XGBoost training model, how to adjust parameters if overfitting?

What are XGBoost’s regular terms?

Why is XGBoost insensitive to missing values?

How does XGBoost handle missing values?

Stop growing conditions for a tree in XGBoost

XGBoost can do feature selection. How does it rate the importance of features?

Similarities and differences between random forest and GBDT

Difference between logistic regression (LR) and GBDT

How do you prune trees in XGBoost

How does XGBoost select the best split point?

XGBoost is very malleable.

Related Posts

Brief analysis of the CacheService architecture of Weibo

Make your Redis not just install to uninstall (Linux environment)

Implementation and Analysis of ERC20 Token Batch Transfer (II)