Machine learning: Colloquially, logically say the principle of Xgboost, for discussion reference

For my first look at Xgboost, I read several blogs and found the description of how Xgboost works to be unbearable and illogical, so I wrote an article for discussion.

— Here’s a primer. Look at the big picture, and then go into the details, and dive into the formula at the beginning anyway, I think it’s inefficient, and it’s easy to discourage people.

Let’s start with the decision tree

What is a decision tree? For example, there is a group of people, I ask you to separate men and women, you divide the crowd into two groups according to the length of hair, long hair is “female”, short hair is “male”, do you divide the crowd according to an indicator “length of hair”, you form a simple decision tree, the official details version is baidu or Google
What is the basis of the division? At this time, you must ask, why use “hair length” to divide ah, can I wear shoes “whether high heels”, “Adam’s apple” and so on to divide ah, Of course! Then there must be a judgment, that is, which kind of classification is good, I choose which kind of ah.
How to evaluate and quantify the classification effect? How to determine “length of hair” or “Adam’s Apple”… Is the best way to divide, how to quantify the effect. Intuitively, if you split the population according to a certain criterion, the higher the purity, the better the effect. For example, if you divide the population into two groups, the “women” group is all women, and the “men” group is all men, the effect is the best, but the fact can’t be that coincidence, so the closer to this situation, we think the effect is better. Therefore, there are many ways to quantify purity, such as information gain (ID3), information gain rate (C4.5), Gini coefficient (CART) and so on
Other details such as pruning, overfitting, pros and cons, parallelism, etc. The soul of the decision tree is already there, splitting the tree by some metric for classification/regression purposes (the example above is classification), always wanting the purity to be as high as possible.

Talk about Xgboost’s tree building process

Xgboost is the integration of many CART regression trees

Concept 1: Regression tree and decision tree In fact, classification and regression are the same type of things, except that the result of classification is discrete value, and regression is continuous, which is the same in essence. They are mapping between feature and result/label. Let’s talk about decision trees and regression trees, and I believe that the classification of decision trees is well understood in the above decision tree explanation.

What is a regression tree?

The sample output of the classification tree (the response value) is in the form of classes, such as whether mushrooms are poisonous or not, whether to go to the movies on the weekend or not. The sample output of the regression tree is in the form of numerical values. For example, the amount of housing loan granted to someone is a specific value, which can be any value between 0 and 1.2 million yuan.

Then, you can’t use the above information gain, information gain rate, gini coefficient to determine the tree node splitting, you can use a new way to predict the error, usually mean square error, logarithmic error and so on. And the node is no longer a category, it’s a number, so how do you determine that, either the in-node sample mean, or the optimization like Xgboost. Details blog.csdn.net/app_1206201… The blogger did a good job
Idea 2: Boosting ensemble learning, which consists of a group of associated decision trees, is called [(2,4, 5)-> 4]. For example, the first decision tree is trained with the sample [(2,4, 5)-> 4], and the prediction of the first decision tree is 3.3. 5)-> 0.7], that is to say, the input sample of the next decision tree will be related to the training and prediction of the previous decision tree.

In contrast, the random Foreast algorithm, in which each decision tree is independent and randomly selects a group of samples from the sample pile and a group of features for independent training, has no yarn relationship among each decision tree.

Therefore, Xgboost is a boosting integration learning, which should be popular
At this time, we can feel the key points for the formation of a regression tree :(1) what split points are divided according to (for example, the minimum mean square error (loss) mentioned above); (2) What are the predicted values of the nodes after classification (as mentioned above, one method is to take the actual mean values of each sample under the leaf node as the prediction error of the leaf node, or the calculated value)

It’s time to take a look at Xgboost

First of all, let’s clarify our goal. We hope to establish K regression trees, so that the predicted value of the tree group is as close to the real value (accuracy) as possible and has the maximum generalization ability (more essential). From the mathematical point of view, this is a functional optimization, multi-objective, look at the objective function:

Intuitively, the target requires that the prediction error should be as small as possible, the leaf nodes should be as few as possible, and the node values should not be as extreme as possible. (What about this? If the label value of a sample is 4, then the first regression tree predicts 3 and the second prediction is 1. The other set of regression trees, one prediction 2, one prediction 2, go the latter way. Why? In the former case, the first tree learns too much and is too close to 4, which means there is a great risk of overfitting.)

Ok, sounds nice, but how to achieve it? How to connect the above objective function with the actual parameters? Remember what we said: parameters of regression tree :(1) which feature split node is selected; (2) The predicted value of the node (can not rely on the mean so rude unreasonable way, at least a bit advanced). The metaphysical formula above does not solve these two “directly”, so how can it solve them indirectly?

Answer: Greedy strategy + optimization (quadratic optimization, yes)

Popular explanation of greedy strategy: is the decision moment in accordance with the current goal optimization decision, put it bluntly is the decision to maximize immediate interests, “short-sighted” strategy, his advantages and disadvantages details we go to understand, classic backpack problem and so on.

How do you use greedy strategy here? You start with a bunch of samples,Put it on the first nodeThis time,.What is it? I don’t know, it’s calculated, and the predicted value for all the samples is going to be zero(The node of the decision tree represents the category, and the node of the regression tree represents the predicted value), insert the label value of the sample, and then the Loss function becomes

Pause. Do you see that here? Quadratic optimization! What if the loss function isn’t quadratic? Oh, does Taylor’s expansion work? I’m not going to try to approximate quadratic.

Next, a feature should be selected to split into two nodes and become a weak sapling. Then :(1) determine the feature used for splitting, how? The simplest is the rough enumeration, and choose the one that works best with Loss Function (for rough enumeration, Xgboost’s improved parallelism will be seen later). (2) How to establish the nodeAnd the smallest loss function, tell me aloud what to do? Yes, the quadratic function of the optimal value (details will be noted, calculation of quadratic maximum value is not a fixed routine, derivative =0 point, OK)

Then the rhythm is to select a feature splitting to calculate the minimum value of Loss function, and then select another feature splitting to obtain the minimum value of loss function… You enumerate, you find the one that works best, you split the tree, you get the saplings.

During splitting, you can notice that only the sample of this node is affected by loss function for each node splitting. Therefore, for each splitting, only the sample of the node to be split needs to be paid attention to to calculate the gain of splitting (the reduction of loss function). There’s an elegant formula that was derived here in the original paper, and I don’t want to type the latex formula,

To study the formula to here matafight. Making. IO / 2017/03/14 /… .

Next, continue to split, in the same way as the above, to form a tree, another tree, each time on the basis of the last prediction to take the best further split/tree, is not greedy strategy? !

All thisThe way the loop iterates must have a stop condition, when to stop:

(1) when introducedThe gain resulting from splitting is less than a thresholdWait, we can cut the split, so not every split the loss function of overall will increase, it was a bit of pre pruning (actually I was a little doubt, generally after pruning effect is better than expected and pruning, but some complex problems, here is a great god please advise, why used here is the thought of the pre pruning, Of course Xgboost supports post-pruning), the threshold parameter isThe coefficient of the number of leaf nodes T in the regular term (please confirm it);

(2) When the tree reaches its maximum depth, it stops building the decision tree and sets a hyperparameter max_depth. This is easy to understand.A fitting;

(3) When the sample weight sum is less than the set threshold, tree construction will be stopped. To explain this, it involves a super-parameter — the minimum sample weight and min_child_weight, which is similar to, but not exactly the same as, the min_child_leaf parameter of GBM. The general idea is that if a leaf node sample is too small, it will be terminatedIt’s also overfitting;

(4) Seems to have seen the largest number of trees… It’s not certain

Specific mathematical derivation details (as long as the node splitting gain calculation formula), please refer to the author (paper author oh) introduction, very detailed! www.52cs.org/?p=429The top one will do as well

Question 1: in what order are nodes split? For example, after the first split, which leaf node is split first?

Answer: er, the same level of parallelism, establishing how to split or not split into leaf nodes, source

Wenku.baidu.com/view/44778c…

Take a look at some of the highlights of Xgboost

Is the optimumPlease come outThis is a novel idea, not an average or a rule.
The regularization technique to prevent over-fitting can be found directly in loss Function as mentioned above.
Support custom loss function, ha ha, I don’t need to say more, as long as it can Taylor expansion (can find the first and second derivative), you are happy;
Boosting supports parallelization, which is the shining point of XGBoost, and its direct effect is fast training. Boosting’s next tree relies on the training and prediction of the aforementioned tree, so it should be only serial between trees. So if you think about it, where can you parallelize? ! Yes, in the selection of the best split point, when the enumeration parallel! (This is also said to be the most time-consuming stage of tree formation.)

Attention: Nodes of the same level can run in parallel. Specifically, for a node, the optimal split point is selected within the node, and the gain of the candidate split point is calculated in parallel with multithreading. –

For example, “whether you are single or not” is easy to split nodes and calculate the gain. However, the feature of “monthly income” has many values ranging from 5K to 50K. It is impossible to try to calculate the split gain for every split point, right? (For example, the monthly income feature has 1000 values, do you use these 1000 as segmentation candidates? Shortcoming 1: calculation amount; shortcoming 2: too few leaf node samples and over-fitting) It is our common habit to divide the interval. Then the problem comes. How to determine the segmentation point of this interval (mean segmentation)?

Method name: Weighted Quantile Sketch

You also remember the first derivative of loss function at nodes (nodes to be split) of each sampleAnd the second derivative, measure the change of loss function caused by the change of predicted value. For example, the sample “monthly income” is arranged in ascending order, 5K, 5.2K, 5.3K… 52K, the dividing line is “income 1”, “income 2″… , “income J”, meet (each interval of the sampleSum/total sampleSum) is a percentage(THIS is an approximation), so it can be divided into approximatelyStudent: A split point.

The mathematical form, I’ll be lazy again (but typing it in latex is a real pain) :

Moreover, there are algorithms suitable for distributed design;
XGBoost also specially designs an algorithm for sparse data. Assuming that the ith feature of the sample is missing, it cannot be used to divide the sample. In this case, the sample is assigned to the specified sub-node by default, and an algorithm is needed to calculate the specific node.

Algorithm is the main idea, assuming that feature respectively the missing samples belong to the right sub tree and left subtree, and in not only the missing sample iteration, calculate the missing samples belong to the right sub tree and the left subtree of the gain and the direction of gain maximum as the default direction of missing data (at first glance if absent for three samples, so did not divide combinations have 8 kinds? Exponential possibility ah, after a closer look, it should be split without missing samples (please confirm or correct if there is a big god), put the first missing sample on the left to calculate the loss function and put it on the right for comparison, and also deal with the second and third samples… Missing samples, so it can be parallel?? ;
Post pruning can be achieved
Cross validation, easy to choose the best parameter, early stop, for example, if you find that the prediction of 30 trees is very good, there is no need to further learn residual, then stop building trees.
Row sampling, column sampling, random forest routines (to prevent overfitting)
Shrinkage, you can add the sum of leaf nodes of several regression trees to the predicted value, or they can be weighted, for example, the predicted value of the first tree is 3.3, the label is 4.0, the second tree is 0.7,… Then the residual of the second tree training is 4.0-3.3*0.3=3.01, which can play a role, and so on, what is the role, to prevent over-fitting, if for “false residual” learning, it is more like the learning rate in gradient descent;
Xgboost also supports setting sample weights, which are represented by gradients G and h, which are adaboost, focusing on certain samples

Take a look at Xgboost’s engineering optimizations

This part because there is no actual combat experience, are papers, blog interpretation, so not very sure, for reference.

Column Block for Parallel Learning iN general: split by Column, store in ascending order; Convenient parallel, at the same time to solve the one-time sample read burst memory situation

Since data is stored in columns and all columns can be accessed at the same time, the split Finding algorithm can be executed for all attributes at the same time to parallelize split Finding — Multiple blocks can be used to store different sample sets. Multiple blocks can be evaluated in parallel – in-feature parallelism
Blocks for Out-of-core Computation

When the data is large, it is divided into multiple blocks and stored on the disk. During calculation, another thread is used to read the data. However, due to the slow I/O speed of the disk, it is usually not fast enough to calculate the data.

For row indexes, only the first index value is stored, and then only the offset between that data and the first index value is stored. 16 bits are used to hold the offset. Therefore, a block usually hasA sample.
…

Compare with GDBT and deep learning

The first impression of Xgboost is to prevent over-fitting + support various distributed/parallel, so it is generally rumored that this kind of killer has good effect (high collocation of integrated learning) + high training efficiency (distributed). Compared with deep learning, it has less stringent requirements on sample size and characteristic data types, and has a wide range of application.

Let’s talk about GBDT: there are two versions of description. GBDT is said to be an iterative residual tree, and each iteration tree is learning the residual of the previous N-1 tree. GBDT is said to be a gradient iterative tree, which is solved by gradient iterative descent method. It is considered that each iteration tree is learning the gradient descent value of the previous N-1 tree. It is said that the former is the special case of the latter under the square error of Loss function. To give you my understanding, again, as an example, after the first tree is formed, there are predicted values, the real value (label) is, the former version represents the next regression tree according to the sampleThe latter means to calculate the gradient negative value near the predicted value of loss Function in the first tree as a new label, that is, corresponding to xgBoost

Here’s a real question:

Whether Xgboost fits the residuals or negative gradients of the next tree, or first derivative + second derivative,? In other words, there is a fitting (input sample) of the GBDT residual tree groupAnd another way to fit it is, Xgboost?

Different machine learning models are suitable for different types of tasks. Deep neural network can capture image, voice, text and other high-dimensional data well by modeling spatio-temporal position. Tree-based XGBoost, on the other hand, works well with tabular data and has some features that deep neural networks do not (e.g. interpretability of the model, invariability of input data, ease of parameter tuning, etc.). Both types of models are important and widely used in data science competitions and industry. Tree Boosting, for example, is used by almost every company that uses machine learning, and XGBoost has already made a big impact in the industry.

References/blog post

(Some may need VPN) Original paper -arxiv.org/pdf/1603.02… Blog 1- blog.csdn.net/a358463121/… Blog 2- blog.csdn.net/sb19931201/… Blog 3- blog.csdn.net/xiaocong199… 4 – blog.csdn.net/totoro1745/ blog… 5 – matafight. Making. IO / 2017/03/14 /… Data 6- wenku.baidu.com/view/44778c…

Machine learning: Colloquially, logically say the principle of Xgboost, for discussion reference

For my first look at Xgboost, I read several blogs and found the description of how Xgboost works to be unbearable and illogical, so I wrote an article for discussion.

Let’s start with the decision tree

Talk about Xgboost’s tree building process

Take a look at some of the highlights of Xgboost

Take a look at Xgboost’s engineering optimizations

Compare with GDBT and deep learning

References/blog post

Related Posts

【 Mathematical Modeling 】 Based on MATLAB GUI barber shop queuing simulation system

AI also understands human emotions: American women smile 40 percent more than men

Easy to understand machine learning – using the server training model in detail