1/ What is GBDT
GB is a Gradient Boost Decision Tree, which represents the Boosting integrated learning idea. The other algorithms include Adaboost and Xgboost. DT: Regression Decision Tree (Regression Decision Tree) GBDT means the DT model trained by Gradient Boosting integrated learning framework. Boosting Integrated Learning Framework (BOOSTING) Is an integrated learning framework with 30 questions. For the first time we do, to the 20 questions, did 10 wrong problem, and then take out 10 problem do wrong, do it again, for the 6 problem, wrong four questions, then do wrong pick out four questions do it again, after repeated rounds of training, no longer make mistakes, so every time is to reduce the error, The questions that you get right add up to 30. The algorithm consists of many decision trees, and the conclusion of all trees adds up to the final answer, which is the addition model. In GBDT algorithm, CART regression decision tree is used, because GBDT uses Boosting integrated learning framework, and each new tree is to reduce the error of the previous tree, which is a continuous value, so regression tree is used. The result of the final model is a weighted combination of base learners.Copy the code
2/ Why GBDT uses cart regression decision tree instead of CART classification decision tree
The decision tree used by GBDT is the CART regression decision tree. The decision trees used by GBDT are all CART regression decision trees, whether dealing with regression problems, dichotomies or multi-classification problems. Why not use CART classification decision trees? Because GBDT uses Boosting integrated learning framework, each iteration fits the error of the last time, which is a continuous value, so regression decision tree is needed.Copy the code
For regression decision tree algorithm, the most important thing is to find the best splicing point (that is, a certain value of a certain feature). The splicing point in regression decision tree contains all values of all features. In the classification decision tree, the discriminant standard of the best partition point is entropy (information Gain/ information Gain ratio Gain_rate) or Gini coefficient, both of which are measured by purity.
The feature selection of ID3 algorithm is based on the information Gain. The feature selection of C4.5 algorithm is based on the information Gain ratio Gain_rate. CART is a binary tree, which can be used for classification or regression. When used for regression, feature selection is based on the sum of squares of error (that is, the square MSE of the difference between the fitting value and the true value). In GBDT, the loss function is the sum of squares of error, which can well judge the degree of fitting.Copy the code
The advantages of 3 / GBDT
<1> The effect is really good; <2> can be used for both classification and regression; <3> The ability to filter characteristics is so appealing that people like to ask this algorithm in interviews.Copy the code
4 / GBDT points
The key points of GBDT, there are roughly the following: GBDT algorithm process? How does GBDT select features? How does GBDT build features? How is GBDT used for classification? In what way does GBDT reduce error? Why is the effect of GBDT better than that of traditional LR? How does GBDT speed up training? What are the parameters of GBDT and how to adjust them? GBDT actual combat encountered some problems? Advantages and disadvantages of GBDT?Copy the code
5/ GBDT algorithm flow
Firstly, GBDT achieves data classification or regression by using the addition model and continuously reducing the errors generated in the training process. With a picture, we points out that the GBDT training process: the most on the left is the overall training data set, and then through a layer of base classifiers, each layer classifier reducing error caused by the previous layer classifiers, each base classifier gives a weight, finally form a complete model (last is composed of strong classifier).Copy the code
GBDT iterates through multiple rounds, each round generates a base classifier, and each classifier is trained based on the residual of the previous round. The requirements for base classifiers are generally simple enough and low variance and high bias. Because the training process is to continuously improve the accuracy of the final classifier by reducing the deviation. The base classifier is typically chosen as a CART decision tree (binary tree). Due to the above high deviation and simple requirements, the depth of each classification regression tree is not very deep. The final total classifier is the weighted sum of the base classifier obtained by each round of training, which is the addition model mentioned above.Copy the code
6/ How to select features to construct decision tree
The base classifier of GBDT selects the CART regression decision tree by default. Of course, other weak classifiers can also be selected, the premise of which is low variance and high deviation. Boosting ensemble learning framework is sufficient. Therefore, the feature selection of GBDT is actually the feature selection of CART decision tree. Now let's talk specifically about how the CART decision tree (a binary tree) is generated (how features are selected). The process of generating CART decision tree is actually a process of selecting features. Assuming that we have a total of M features, the first step is to select a feature J as the first node (and root node) of the binary tree. Then select a segmentation point for the value of feature j. If the value of feature J of a sample is smaller than this segmentation point, it is divided into one class; if it is larger than this segmentation point, it is divided into another class. This builds a node of the CART tree. The same is true for the other nodes. Now the big question is how to choose this feature J, and how to choose the best point s for the segmentation of feature J.Copy the code
7/ How to construct features
We can use GBDT to generate combinatorial features. Going from the leaf to the root is a composite feature. Industry generally using logistic regression to processing, in my last blog post has said, logistic regression linear separable itself is suitable for processing the data, if we want logistic regression to deal with nonlinear data, one way is the combination of different characteristics, the ability of logistic regression fitting of nonlinear distribution. See below:Copy the code
The model consists of two trees, tree1 uses age and IS male as split nodes, and leaf nodes are output scores. Tree2 uses whether the daily computer is used as the split node, and the leaf node is also the output score. Suppose the test sample is as follows:Copy the code