Xgboost is a kaggle match killer
Note: If some pictures can not be displayed properly and affect the reading, please refer to the article here: Xgboost question library version. Date: March 25, 2019.
0 foreword
Xgboost has long been touted as a magic weapon in the competition world, like every now and then in a Kaggle/Tianchi contest, someone used xGBoost to beat a thousand horses to the top.
Xgboost will also be covered in our machine learning courses, as Han says: “RF and GBDT are the industry’s favorite models, Xgboost is a big killer package, Kaggle’s various Top rankings once presented Xgboost dominated the situation, another didi competition first place improvement is not without Xgboost credit”.
In addition, the company online in July from the first half of 2016, began to organize students to participate in a variety of competitions, in order to grow in the actual competition projects (after all, it is impossible to engage in AI without actual combat, and participate in the competition through data processing, feature selection, model tuning, code tuning, is an excellent opportunity for real combat. It is very helpful to improve my ability and to find/change jobs.
Under the trend of AI, this year, there are a lot of friends who have changed from traditional IT industry to post to AI industry. Many friends have asked how to change AI industry. I usually emphasize the four king kong for learning AI or finding/changing AI: course + question bank + OJ + Kaggle/Tianchi. Graduation exams, including intensive training camps, are more integrated with Kaggle or Tianchi competitions.
Considering the importance of the Kaggle/Tianchi Contest for mathematical science, this article introduces XGBoost to help you get a quick start on XGBoost and excel in the contest.
Finally, I July didn’t invent XgBoost, but I’ll make sure this article is the most accessible (and reviewed by Dr. Chen, head of online AI Lab in July). In addition, thanks for all the references listed at the end of this article, if you have any questions, please feel free to leave a comment, thanks.
1 the decision tree
For example, there are more than 100 trainees in a training camp. Suppose you were given a task and asked to count the number of boys and girls. When one by one trainees came on stage and stood in front of you, how would you tell who was male and who was female?
Soon, you consider that boys’ hair is generally very short and girls’ hair is generally long, so you divide all the students in this class into two groups according to the length of hair. The students with long hair are “women” and the students with short hair are “men”.
It’s like you’ve divided the class up by one metric, “hair length,” and you’ve got a simple decision tree based on hair length. At this time, some people may have different opinions: why should we use “hair length” division ah, I can use “whether the shoes are high heels”, “have Adam’s apple” and so on to divide, the answer is certainly yes.
But which indicator is better? It’s very straightforward to decide which category works best and which category should be used first. Therefore, an evaluation criterion is needed to quantify the effect of classification.
How to judge whether “hair length” or “Adam’s Apple” is the best way to divide, and how to quantify the effect? Intuitively, the higher the purity, the better the effect, if you divide people into two groups, for example, the “women” group is all women, and the “men” group is all men, then the effect is the best. But sometimes the actual classification is not so good, so the closer we get to that, the better we think it is.
There are many ways to quantify the effect of classification, such as information gain (ID3), information gain rate (C4.5), Gini coefficient (CART), and so on.
The measure of information gain: entropy
The core idea of ID3 algorithm is to use information gain to measure attribute selection, and select the attribute with the largest information gain after splitting.
What is information gain? To define the information gain precisely, we first define a metric widely used in information theory called entropy, which characterizes the purity of any sample set. Given a sample set S containing positive and negative samples about a certain target concept, the entropy of S relative to this Boolean classification is:
In the above formula, p+ represents the positive example, as in the second example at the beginning of this article, p+ means to play badminton, and p- represents the negative example, not to play badminton (in all entropy calculations we define 0log0 as 0).
For example, if S is a set of 14 samples about Boolean concepts, including 9 positive examples and 5 negative examples (we use the notation [9+, 5-] to summarize such data samples), then the entropy of S relative to this Boolean example is:
Entropy ([9+, 5-]) =- (9/14) log2 (9/14) – (5/14) log2 (5/14) =0.940.
So, according to the above formula, we can get:
- If all members of S belong to the same class, Entropy(S)=0;
- If the number of positive and negative samples of S is equal, Entropy(S)=1;
- If the number of positive and negative samples of S is not equal, the entropy is between 0 and 1
As shown below:
See, from the value of Entropy, you can evaluate how well the current classification tree classifies.
For more details on pruning, overfitting, pros and cons, see decision Tree Learning.
So, now the soul of the decision tree is already there, which is to split the tree by some index for classification/regression purposes, always wanting the purity as high as possible.
2. Regression tree and ensemble learning
If you define XGBoost in one sentence, it’s easy: XgBoost is an integration of many CART trees. But what is a CART tree?
There are two main types of decision trees used in data mining or machine learning:
- Classification tree analysis is where the prediction results are the classes that the data belongs to (for example, whether to watch a movie or not)
- Regression tree analysis refers to predictions that can be considered real (such as the price of a house, or the length of a patient’s stay in a hospital).
The term CART, Classification And Regression Tree analysis is the general term used to refer to the above two kinds of trees, which was first proposed by Breiman et al.
2.1 Regression Tree In fact, classification and regression are two very close problems. The goal of classification is to determine which known sample class a new sample belongs to according to some characteristics of the known sample, and the result is discrete value. The result of regression is continuous values. Of course, the essence is the same, which is mapping between feature and result/label.
Once you understand what classification and regression are, it is not difficult to understand classification trees and regression trees.
The sample output of the classification tree (i.e., the response value) is in the form of classes, such as whether the life-saving medicine is real or fake, or whether to go to see the movie “Wind Curse” on the weekend. The sample output of the regression tree is in the form of numerical values. For example, the amount of housing loan granted to someone is a specific value, which can be any value between 0 and 3 million yuan.
Therefore, for regression trees, you can no longer use the information gain, information gain rate and Gini coefficient of classification trees to determine node splitting. You need to adopt a new way to evaluate the effect, including the prediction error (commonly used mean square error, logarithmic error, etc.). And the node is no longer a category, it is a value (predicted value), so how to determine? Some are the in-node sample mean, some are optimized like Xgboost.
CART regression tree assumes that the tree is a binary tree by constantly splitting features. For example, the current tree node is split based on the JTH eigenvalue. Let the samples whose eigenvalue is less than S be divided into the left subtree, and those larger than S are divided into the right subtree.
CART regression tree is essentially to divide the sample space in this feature dimension, and the optimization of such spatial partition is a NP-hard problem. Therefore, heuristics are used to solve it in the decision tree model. The objective function generated by a typical CART regression tree is:
Therefore, in order to solve the optimal segmentation feature J and the optimal segmentation point S, it is transformed into solving such an objective function:
So as long as we traverse all the segmentation points of all features, we can find the optimal segmentation features and segmentation points. You end up with a regression tree.
2.2 Boosting ensemble learning
The so-called ensemble learning refers to the construction of multiple classifiers (weak classifiers) to predict the data set, and then use some strategy to integrate the predicted results of multiple classifiers as the final prediction results. A popular metaphor is “three heads are better than several heads”, or the voting decision of each director on the board of directors of a company. It requires each weak classifier to have a certain “accuracy”, and there are “differences” among the classifiers.
Ensemble learning can be divided into Boosting and Bagging according to whether or not each weak classifier has dependencies.
- Boosting, which has a dependency among each classifier, must be sequential, such as Adaboost, GBDT(Gradient Boosting Decision Tree), and Xgboost
- Bagging is a genre in which each classifier is independent of each other and can run in parallel, such as Random Forest.
The famous Adaboost is the most representative method in Boosting, and this blog has introduced it in detail.
AdaBoost, which is an abbreviation of Adaptive Boosting, was proposed by Yoav Freund and Robert Schapire in 1995. Its self-adaptation lies in that the samples misclassified by the previous basic classifier will be strengthened, and all the weighted samples will be used to train the next basic classifier again. At the same time, a new weak classifier is added in each round until a predetermined small error rate or a predetermined maximum number of iterations is reached.
Specifically, the entire Adaboost iterative algorithm consists of three steps:
- Initialize the weight distribution of training data. If there are N samples, each training sample is initially given the same weight: 1/N.
- Train weak classifiers. In the specific training process, if a sample point has been accurately classified, its weight will be reduced in the construction of the next training set. Conversely, if a sample point is not accurately classified, its weight is increased. Then, the sample set with updated weights is used to train the next classifier, and the whole training process goes on iteratively.
- The trained weak classifiers are combined into strong classifiers. After the training process of each weak classifier, the weight of the weak classifier with small classification error rate is increased to make it play a larger decisive role in the final classification function, while the weight of the weak classifier with large classification error rate is reduced to make it play a smaller decisive role in the final classification function. In other words, the weak classifier with low error rate has a larger weight in the final classifier, otherwise it has a smaller weight.
The other boosting method, GBDT (Gradient Boost Decision Tree), is different from AdaBoost in that each calculation of GBDT aims to reduce the last residual, and then establish a new model in the direction of residual reduction (negative Gradient).
Boosting ensemble learning is made up of several associated decision trees. For example
- A sample [data -> label] is: [(2,4,5)-> 4]
- The first decision tree trained with this sample had a prediction of 3.3
- Then the input of the second decision tree training, the sample becomes: [(2,4,5)-> 0.7]
- In other words, the input sample of the next decision tree will be relevant to the training and prediction of the previous decision tree
You’ll soon realize why Xgboost is also a Boosting ensemble.
The key point for the formation of a regression tree is:
- What splitting points are divided according to (e.g., minimum mean square error (LOSS) mentioned above);
- What is the predicted value of the node after classification? (As mentioned above, one method is to take the actual mean value of each sample under the leaf node as the prediction error of the leaf node, or the calculated value)
As for another type of integrated learning method, such as the Random Forest algorithm, each decision tree is independent. Each decision tree randomly selects a group of samples from the sample pile and a group of features for independent training. There is no relationship between each decision tree. This article will not be introduced.
3.GBDT
In terms of Xgboost, we have to start with GBDT(Gradient Boosting Decision Tree). Because xGBoost is essentially a GBDT, but strives for maximum speed and efficiency, it has been called X (Extreme) Gvaughted. Both are boosting methods, including those mentioned above.
The principle of GBDT is very simple, that is, the results of all weak classifiers add up to the predicted value, and then the next weak classifier fits the gradient/residual of the error function to the predicted value (this gradient/residual is the error between the predicted value and the real value). And, of course, the weak classifiers in it take the form of individual trees. As shown in the figure: Y = Y1 + Y2 + Y3.
Take a very simple example, for example, I am 30 years old this year, but the computer or model GBDT does not know how old I am, then GBDT what to do?
- It will fit a random age in the first weak classifier (or the first tree), say 20 years, and find that the error is 10 years;
- Then in the second tree, six years was used to fit the remaining loss, and the gap was four years;
- Then, in the third tree, we matched the remaining gap with 3 years and found that the gap was only 1 year;
- Finally, the remaining residuals were fitted with 1 year old in the tree in Lesson 4, perfect.
In the end, the four trees added up to a real age of 30. In practical engineering, GBDT is to calculate the negative gradient and use the negative gradient to approximate the residual.
Notice why GBDT can approximate residuals with negative gradients?
Under the regression task, GBDT has a predicted value for each sample in each round of iteration, and the loss function at this time is the mean square error loss function.
So the negative gradient is calculated like this
Therefore, when the loss function selects the mean square loss function, the value of each fitting is (true value – the value predicted by the current model), that is, the residual. The variable here is theta, which is the value of the current prediction model, which is the negative gradient.
In addition, here have to take a long-winded again, the above prediction age “casual” two characters in the first step in seemingly casually, actually deeply think about is not casually, you will find that most of the prediction model, basic it is such a regular routine, casually with a value to predict first, and then compare the predicted values and the real value of the gap, constantly adjust in the end Close the gap. So you have a set of objective functions: identifying objectives, and a loss function: reducing errors.
Further thinking, you will find that this is completely consistent with the common sense and common practice of human prediction. When we do not know much about a thing, we will try and make preliminary exploration according to experience until we approach or completely coincide with it in a certain sense.
Again, age prediction.
For simplicity, suppose there are only four people in the training set: A,B,C, and D, whose ages are 14,16,24, and 26 respectively. A and B are students of senior one and senior three respectively. C and D are fresh graduates and employees who have worked for two years.
So, the question now is how do we predict the ages of these four people? It’s easy to fit them with a random age, say 20, and then adjust them accordingly.
If a traditional regression decision tree is used for training, the following results will be obtained:
Now we use GBDT to do this. Due to the lack of data, we limit the number of leaf nodes to two, that is, each tree has only one branch and only two trees to learn.
We get something like this:
In the first branch tree, as shown in Figure 1, since A and B are relatively similar in age, and C and D are relatively similar in age, they are divided into left and right groups, and the average age of each group is used as the predictive value.
- The residual is calculated (the residual means: the actual value of A – the predicted value of A = the residual of A), so the residual of A is the actual value 14 – the predicted value 15 = the residual value -1.
- Notice, the predicted value of A is the sum of all the trees in front of it, there’s only one tree in front of it so it’s just 15, and if there are other trees you have to add them all up as the predicted value of A.
Residual in mathematical statistics is the difference between the actual observed value and the estimated value (fit value). Residuals contain important information about the underlying assumptions of the model. If the regression model is correct, we can treat the residual as an observation of the error.
Then the residuals of A,B,C and D are -1,1, -1,1 respectively.
Then replace the original values of A, B, C and D with their residuals -1, 1, -1 and 1, and go to the second tree to learn. The second tree only has two values 1 and -1, which are directly divided into two nodes, namely, A and C points are on the left, B and D points are on the right. After calculation (for example, A, actual value -1 – predicted value -1 = residuals 0, such as C, Actual value -1 – predicted value -1 = 0), and then the residual is 0 for everyone.
The residuals are all 0, which is equivalent to the predicted value of the second tree being equal to their actual value. Then the real age can be obtained by adding the conclusions of the second tree to the first tree, that is, everyone gets the real predicted value.
In other words, the predicted values of A,B,C and D are now consistent with real age. Perfect! A: Fourteen-year-old students in senior one often ask questions of their upperclassmen. Their predicted age is A = 15-1 = 14. B: Sixteen-year-old students in senior three often ask questions of their upperclassmen
C: 24 years old graduates, shopping more, often ask questions of the elder brother, predicted age C = 25 — 1 = 24 D: 26 years old employees who have worked for two years, shopping more, often asked questions by the elder brother, predicted age D = 25 + 1 = 26
Therefore, GBDT needs to accumulate the scores of multiple trees to get the final predicted score, and in each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted results of the preceding tree and the true value.
4.Xgboost
4.1 Definition of XgBoost tree
The schematic diagram in this section is basically quoted from xGBoost’s original author Chen Tianqi’s lecture PPT.
For example, we want to predict a family on the degree of be fond of computer games, considering the compared to young and old, the young are more likely to love video games, as well as compared to men and women, men are more like video games, so the first according to the age to distinguish between children and adults, and then distinguish by gender is male or female, one by one for each grade in the video game preferences degree, As shown in the figure below.
In this way, two trees tree1 and Tree2 were trained, similar to the principle of GBDT before. The sum of the conclusions of the two trees was the final conclusion, so the predicted score of the child was the sum of the scores of the nodes that the child fell into in the two trees: 2 + 0.9 = 2.9. Grandpa’s predicted score was the same: -1 + (-0.9) = -1.9. The details are shown in the following figure
Well, you may have to stand up and exclaim, this is not to follow the article introduced GBDT is the same thing?
In fact, one of the biggest differences between XGBoost and GBDT is the definition of the objective function, leaving aside some differences in engineering implementation and problem solving. The target function of Xgboost is shown below:
Among them
- The red arrow points to L, which is the loss function (e.g., the squared loss function:, or logistic loss function:)
- The red boxes are the regulars (L1 regulars, L2 regulars).
- The red circle is the constant term
- for, Xgboost uses Taylor to expand three terms and make an approximation
We can clearly see that the final objective function only depends on the first and second derivatives of each data point on the error function.
Well, a lot of people might be blindsided by the sudden loss of such a big formula. Okay, now let’s take apart the objective function and break down the meaning of each formula, each symbol, and each subscript.
4.2 Xgboost objective function
The core algorithm idea of XGBoost is not hard, basically
-
You keep adding trees, you keep doing feature splitting to grow a tree, one tree at a time, you’re learning a new function to fit the residuals of the last prediction.
Note: w_q(x) is the fraction of leaf node Q,Corresponding to the set of all K regression trees, and F (x) is one of the regression trees.
-
When we complete the training and get K trees, we need to predict the score of a sample. In fact, according to the characteristics of this sample, a leaf node will fall into each tree, and each leaf node corresponds to a score
-
Finally, you just add up the scores of each tree to get the predicted value of the sample.
Obviously, our goal is to make the predicted value of the treeAs close as possible to the real valueAnd have the maximum generalization ability.
Therefore, from the mathematical point of view, this is a functional optimization problem, so the objective function is simplified as follows:
As you can see, the objective function is divided into two parts: the loss function and the regularization term. The loss function reveals the training error (i.e. the difference between the predicted score and the real score) and regularizes the definition complexity.
For the equation above,Is the output of the entire summation model, regularization termIs a function representing the complexity of the tree. The smaller the value is, the lower the complexity is and the stronger the generalization ability is. The expression is
T represents the number of leaf nodes and W represents the fraction of leaf nodes. Intuitively, the target requires the prediction error as small as possible, and the leaf node T as little as possible (γ control the number of leaf nodes), node value W as far as possible not extreme (λ control the leaf node fraction will not be too large), to prevent overfitting.
By the way, the general target function contains the following two terms
Among them, the error/loss function encourages our model to fit the training data as much as possible, so that the final model will have less bias. Regularization terms encourage simpler models. When the model is simple, the randomness of the results obtained by fitting finite data is relatively small and it is not easy to overfit, which makes the prediction of the final model more stable.
4.2.1 Model learning and training error
Specifically, in the first part of the target functionAccording to the firstA sample, ( − The first saidA sample error, of course, our goal is as small as possible.
Similar to the previous GBDT routine, XGBoost also needs to accumulate the scores of multiple trees to get the final predicted score (in each iteration, a tree is added on the basis of the existing tree to fit the residual between the predicted result of the previous tree and the real value).
But how do we choose what to put in each round? The answer is pretty straightforward, pick oneTo make our objective function as low as possible.
And again, given thatModel predicted value of round T = model prediction of the previous T-1 round + , so the error function is denoted as: ( .+ ), the following item is the regularization item.
For this error function, at step t,Is the true value, which is known as **** can be obtained from step T -1 of the previous stepaddThe calculated values, in some sense, are also known values, so the model is learning.
The expression of Obj may be too abstract, so we can consider whenIs the case of squared error), at which point our goal can be written as a quadratic function (the circled part of the figure represents the residual between the predicted value and the true value) :
More generally, what if the loss function is not a quadratic function? Using Taylor expansion, instead of quadratic approximation (as you can see, define the first derivative)And the second derivative).
Yeah, yeah, look out! Many people could be stuck here, also there are few articles to explain online, online AI lab in July and Dr Chan discussion, find it the most critical is the second order Taylor expansion the corresponding relation of the objective function and xgboost clear, as we can do it using the second order Taylor expansion to target function approximation.
First of all, this is Taylor’s second order expansion
Corresponds to xgBoost’s target function
Ignoring loss functionIn the(Don’t forget that “At step T,Is the true value, that is, known “, does not affect the subsequent objective function pairs), do the following one-to-one correspondence:
- The x in Taylor’s second order expansion f corresponds to the x in the target function.
- In the fCorresponding to the target function.
- So the derivative of f with respect to x corresponds to the target function pairStrives for the partial derivatives
Get:
Among them,.
Oh, my god! But where did the key Taylor quadratic expansion in this transformation come from?
In mathematics, Taylor’s Formula (English: Taylor’s Formula) is a Formula that uses information about a function at a point to describe its nearby value. This formula from Taylor theorem of differential and integral calculus (Taylor ‘s unseen), Taylor theorem describes a differentiable function, if the function is smooth enough, in a known function at some point, the derivative value of Taylor formula can do with this derivative value coefficient of a polynomial is constructed to approximate function value in this field, This polynomial is called the Taylor Polynomial.
It tells us that we can approximate the original function by using some of the subterms of the Taylor polynomial.
Taylor’s theorem: let n be a positive integer. If a function f defined on an interval containing a is differentiable n+1 times at a, then for any x on the interval, there is:
The polynomial is called the Taylor expansion of the function at a, the remainderIt’s the remainder of Taylor’s formulaOf higher order infinitesimal.
Next, consider that our t-th regression tree is derived from the residual of the previous T-1 regression tree, which is equivalent to the value of t-1 tree **** is known. In other words,The optimization of the objective function is not affected, and can be directly removed, and the constant term can also be removed, so as to obtain a relatively unified objective function as follows.
In this case, the objective function only depends on the first derivative of each data point on the error functionAnd the second derivative(You can already see the difference between XGBoost and GBDT in that the target function retains Taylor’s quadratic expansion).
The general guidelines are as stated by Google reader Crab6789:
In essence, the distribution of samples to leaves corresponds to an OBJ, and the optimization process is OBJ optimization. That is to split nodes into different combinations of leaves, and different combinations correspond to different OBJ. All optimization is carried out around this idea.
So far we have discussed the first part of the objective function: the training error. Next we discuss the second part of the objective function: the re term, which defines the tree’s complexity.
4.2.2 Regular terms: Tree complexity
First, a few rules
- Represented by leaf node set and leaf node score
- Each sample falls on a leaf node
- Q (x) indicates that sample X is on a certain leaf node, and wq(x) is the score of the node, that is, the model predicted value of the sample
So when we divide the tree into the structure part Q and the leaf weight part W, the structure function Q maps the input to the index number of the leaves, and W gives us what the leaf fraction is for each index number.
In addition, as shown in the figure below, XgBoost’s tree complexity consists of two parts:
- One is the number of leaves in the tree, T
- One is the L2 module square of the score W of the leaf node in the tree (L2 regularization of W is equivalent to increasing L2 smoothing for the score of each leaf node to avoid over-fitting)
Under this new definition, we can transform the previous objective function as follows (also, don’t forget:)
Among themIs defined as the set of sample subscripts above each leaf node JG is the first derivative and h is the second derivative. This step is due to the addition of two regular terms in the second part of xGBoost objective function, one is the number of leaf nodes (T) and the other is the score of leaf nodes (w).
Thus, there are two kinds of accumulations in the objective function with the regular term added
- One is the– > n (sample size)
- One is the-> T (number of leaf nodes)
This target contains T independent quadratic functions of one variable.
What’s the key to understanding this derivation? After discussion with Dr. Chen in AI Lab, it is really about understanding this definition:Is defined as the set of sample subscripts above each leaf node JQ (xi) in this definition means that each sample value xi can be mapped to a leaf node in the tree through the function q(xi), thus the two kinds of summation are unified through this definition.
And then we can define
The final formula can be simplified to 1
Based on theSo if you take the derivative of that, you get 0
Then put theThe optimal solution can be substituted into:
4.3 Scoring function calculation
Obj represents how much we can subtract from the target when we specify the structure of a tree. We could call it a structure score.
4.3.1 Splitting a Node
It was interesting to see how XGBoost was optimized and calculated from start to finish, but we never saw what the tree actually looked like. Obviously, a tree is created by dividing a node into two parts, then splitting into the whole tree. So how the tree splits is going to be the key to what we’re going to talk about.
As for how a leaf node can be split, the XGboost authors in their original paper gave two ways to split the node
(1) The greedy method of enumerating all different tree structures
Now the situation is that if you know the structure of the tree, you can get the best score for that structure, so how do you determine the structure of the tree?
A logical approach would be to enumerate the structures of different trees, then use a scoring function to find the tree with the best structure, then add it to the model and repeat the process. On second thought, you realize that there are so many states to enumerate that they are almost infinite. So what do you do?
We try greedy method, from the tree depth 0, each node traverse all of the features, such as age, gender, etc., and then for a characteristic, the value in the first according to the characteristics of the sorting, then linear sweep the characteristics, in turn, determine the best split point, finally, after all the characteristics of the segmentation, we choose to Gain the highest Gain the characteristics of the so-called, How do you calculate Gain?
Remember what we got at the end of 4.2?
In other words, the G/(H+λ) part of the objective function represents the contribution degree of each leaf node to the current model loss. After fusion, the calculation expression of Gain can be obtained as follows:
The first thing worth noting is “For a feature, sort by the values in that feature”, for example.
Such as setting up a value to a, then enumerates all x < a, b < x such conditions (x represents some characteristics such as age, the age, the age smallest: assume that increases from left to right in turn, more than a little on the left, is bigger than a on the right side), for a particular division a, we need to calculate the derivative and a left and right.
For example, there are five people in total, and they are sorted according to their age. At the beginning, we have the following four methods:
- Separate the first person from the next four
- Separate the first two from the last three
- Separate the first three from the second two
- Divide the front four from the back one
Then, all the above four partitioning methods were calculated separately to see which partitioning method obtained the largest Gain value, and which partitioning method was selected. After calculation, it was found that the second partitioning method “divided the first two people and the last three people” obtained the largest Gain value. It means that this division works best on the first level of bisection.
In other words, for all feature X, we can enumerate all segmentation gradients and GL and GR by just doing a left-to-right scan. Then use the formula used to calculate Gain to calculate the score of each segmentation scheme.
Then, the division of the second, third, fourth and NTH layers was continued according to this division method.
The second thing to note is that introducing segmentation doesn’t necessarily make things better, so we have a penalty term for introducing new leaves. The objective of optimization corresponds to the pruning of the tree. When the gain of the introduced segmentation is less than a threshold value γ, the segmentation is ignored.
In other words, when a certain partition is introduced, the result is that the fraction obtained after partition – the fraction obtained without partition is too small (for example, less than our lowest expected threshold γ), but the resulting complexity is too high, which is equivalent to the gain is not worth the loss, it is better not to partition. That is, the benefits of doing a certain action are not much greater than the disadvantages of doing so, in order to avoid the complex attitude of doing less than doing more, it is better not to do it.
Equivalent to when we find that “divide” is better than “divide” (the gain obtained is too small, smaller than threshold γ), there will be two leaf nodes in the same subtree.
Here is the algorithm in the paper
(2) Approximation algorithm
The data is too large to be calculated directly
Comment by Reader Crab6789 at Google:
Distributing samples from the root to the leaf is a permutation. Different combinations correspond to different cos (t). To get the best combination, you have to try. It’s impossible to just do everything. That’s why the greedy method comes out. Don’t look at the beginning to the end and see how the current node is best allocated. Hence comes the exact Greddy method, and later, only when we want to accelerate histogram, do we have the method of histogram.
4.4 Summary: Vauxhall Tree Algorithm
To sum up, see the diagram
Let’s go over the whole process again.
If a sample label value is 4, then the first regression tree predicts 3 and the second predicts 1. The other set of regression trees, one prediction 2, one prediction 2, go the latter way. Why? In the former case, the first tree learns too much and is too close to 4, which means there is a great risk of overfitting.
OK, that sounds nice, but how do you do that? How does this target function relate to the actual parameters? Remember we talked about the parameters of the regression tree:
- Which feature is selected to split the node
- The predicted value of the node (it can’t be averaged in such a rude and unreasonable way, at least advanced point)
The final strategy is greed + optimization (yes, quadratic optimization).
The popular explanation of greedy strategy: it is the optimal decision in accordance with the current goal at the moment of decision, to put it bluntly, it is the decision to maximize immediate interests, “short-sighted” strategy.
At the beginning, you have a group of samples and put them on the first node. At this time, T=1, what is w? I don’t know
- If the l(w−yi) error here is represented by the square error, then the above function is a quadratic function of w and the minimum value is the predicted value of this node, and the minimum value of the function is the minimum loss function.
- Essentially, this is a quadratic optimization problem! But what if the loss function isn’t quadratic? Taylor expansion, not quadratic try to approximate quadratic.
Next, a feature should be selected to split into two nodes and become a weak sapling, so it needs:
- Determine the split feature, how? The simplest is a rough enumeration/exhaustion (well, rough enough) and then choose the one that works best for loss Function;
- How to establish the node w and the minimum loss function, tell me aloud what to do? Yes, the quadratic function of the maximum value (calculation of quadratic maximum value generally have a fixed routine, that is, the derivative is equal to 0 point). Therefore, select a feature to split and calculate the minimum value of Loss Function. Then select another feature to split and obtain the minimum value of Loss Function. After enumerating, find the one with the best effect and split the tree to obtain the sapling.
During splitting, you can notice that only the sample of this node is affected by loss function for each node splitting. Therefore, for each splitting, only the sample of the node to be split needs to be paid attention to to calculate the gain of splitting (the reduction of loss function).
In summary, XGBoost uses the same idea as the CART regression tree, using greedy algorithms to traverse all feature partition points for all features, but using different target functions. The specific approach is that the objective function value after splitting is better than the gain of the objective function of the single leaf node. At the same time, in order to limit the tree growing too deep, a threshold value is added. Only when the gain is greater than this threshold value can splitting be carried out.
The thresholds are set below
Is it a greedy strategy to continue to split, form a tree, form another tree, each time taking the best prediction to further split/build the tree?
There must be a stop condition for this kind of loop iteration. When should it stop? In short, set the maximum depth of the tree and stop growing when the sample weight sum is less than the set threshold to prevent overfitting. Specifically, then
- When the gain brought by the introduced splitting is less than the set threshold value, we can ignore the splitting. Therefore, not every splitting loss function will increase as a whole, which means pre-pruning. The threshold parameter is(i.e. the coefficient of the number of leaf nodes T in the regular term);
- When the tree reaches the maximum depth, the establishment of the decision tree is stopped, and a hyperparameter max_depth is set to avoid the over-fitting caused by learning local samples because the tree is too deep.
- When the sample weight sum is less than the set threshold, the tree construction stops. The minimum sample weight and min_child_weight are similar but not identical to GBM’s min_child_leaf parameters. The general idea is that a leaf node sample is too small, also terminated to prevent overfitting;
- Looks like the largest number of trees ever seen…
6 References and recommended reading
- Xgboost Original paper: arxiv.org/pdf/1603.02…
- Xgboost author handout PPT: homes.cs.washington.edu/~tqchen/pdf…
- XGBoost and Boosted the Tree
- Xgboost principle: blog.csdn.net/a819825294/…
- Xgboost introduction and actual combat (principle) : blog.csdn.net/sb19931201/…
- The CART Wikipedia
- Ensemble Learning
- Boosting and Random Forest: Ensemble learning
- From decision tree learning to Bayesian classification algorithm
- Decision tree (3) – complete summary (ID3, C4.5, CART, pruning, substitution) and source code analysis
- Read the machine learning killer XGBoost principle
- Why is GBDT and random Forest so good in the actual Kaggle competition?
- Popular, logical write an article about the principle of Xgboost for discussion
- Decision Trees, Random Forests, GBDT, XgBoost
- Some of xgBoost’s core parameters
- What is a popular way to understand Taylor series? : www.zhihu.com/question/21…
- Xgboost why need the second order Taylor expansion: www.julyedu.com/question/bi…
- The most popular gradient ascending tree (GBDT) principle
- ID3, C4.5, CART, Random forest, Bagging, Boosting, Adaboost, GBDT, XGBoost algorithm Summary
- Derivation of XGBoost second order Taylor expansion formula
- Taylor Company Wikipedia
- Xgboost notes by Super Mario, a student in July episode 6, explains how XgBoost works
- Online editing LaTeX formula: private.codecogs.com/latex/eqned…
Afterword.
When you don’t understand something, chances are it’s not because you’re not smart enough. Chances are it’s because the information you read is not popular enough. (If it still doesn’t work, ask someone).
I hope you, reading this, feel the same way.
The improvement process of this paper is as follows for reference:
- 8.4 Morning first edition, introducing GBDT with an easy-to-understand age prediction example, as GBDT is the basis for understanding XGBoost;
- 8.4 PM 2nd edition, Xgboost derivation formula is many, beginners are easy to fall into, after grasping xgboost core: objective function, comb clear xgboost context framework;
- 8.5 In the third edition in the morning, the introduction of decision tree is optimized, such as the introduction of information gain is added;
- In the 4th edition in the afternoon of August 5, the display of most formulas has been optimized. For example, the display of formulas has been changed from plain text to LaTeX pictures.
- 8.6 morning fifth edition, optimize the introduction of booting integrated learning, has made the full text more gradual;
- In the evening of 8.6, the sixth edition standardizes the formula expression of xGBoost objective function, and combs out many details and formulas in the full text;
- 9.1 Am 7th edition, improve the xgboost tree split in section 4.3.1 to make it clearer;
- In the eighth edition of 1.9 in 1999, the description of split nodes in section 4.3.1 was improved to make logic clearer and writing more popular.
- 19 1.10 9th edition, Part 3 adds an example of predicted age to explain GBDT in a more general way;
- Taylor’s second order expansion and xgboost objective function of the corresponding relationship written clearly, so as to understand more smoothly.
After reading the whole article, you will find that there are three key points to understanding XgBoost: (1) the one-to-one relationship between the terms of the xgBoost objective function and Taylor’s expansion in section 4.2.1, (2) a mathematical trick used to calculate the score W of a number in Section 4.2.2, and (3) the tree splitting algorithm shown in Section 4.3.1. With these three points sorted out, there are no more obstacles to understanding XGBoost.
July, August 6, 2018 to January 14, 2019.