An overview of the main contents of this article:
1. XGBoost profile
XGBoost, which stands for eXtreme Gradient Boosting, is an optimized distributed Gradient Boosting library designed to be efficient, flexible, and portable. XGBoost is a massively parallel Boosting Tree tool, which is currently the fastest and best open source Boosting toolkit, more than 10 times faster than the common toolkit. In data science, a large number of Kaggle players use XGBoost for data mining contests, which is a must-have weapon in data science contests. In terms of industrial large-scale data, the distributed version of XGBoost is widely portable and can run on Kubernetes, Hadoop, SGE, MPI, Dask and other distributed environments, making it a good solution to the problem of industrial large-scale data. This article will introduce XGBoost from the mathematical principles and engineering implementation, then introduce the advantages and disadvantages of XGBoost, and finally give the interview questions about XGBoost.
2. Theoretical derivation of XGBoost
2.1 Generate a tree starting from the objective function
Both XGBoost and GBDT are boosting methods. Apart from some differences in engineering implementation and problem solving, the biggest difference is the definition of objective function. Therefore, in this article we explore the fundamentals of XGBoost starting with the target function.
2.1.1 Learn t tree
XGBoost is an addition model composed of base models. Assuming that the tree model to be trained in the first iteration is, then:
2.1.2 Target function of XGBoost
The loss function can be expressed by predicted value and true value:
Where, is the number of samples.
We know that the prediction accuracy of the model is jointly determined by the deviation and variance of the model, and the loss function represents the deviation of the model. If the variance is small, the regular term needs to be added to the objective function to prevent overfitting. Therefore, the objective function is composed of the loss function of the model and the regular term to suppress the model complexity. The objective function is defined as follows:
Where, the sum of the complexity of all trees is added to the objective function as a regularization term to prevent over-fitting of the model.
As XGBoost is an algorithm in boosting family, it follows forward step addition. Taking the model of step 1 as an example, the predicted value of the model for the first sample is:
Where, is the predicted value given by the model at the first step, is the known constant, and is the predicted value of the new model to be added this time. At this point, the objective function can be written as:
Notice that in the above equation, there is only one variable, that is the first tree, and the rest are known or can be computed from known quantities. Careful students might ask, where did the second to third lines in the above equation come from? Here we split the regularization term. Since the structure of the former tree has been determined, the sum of the complexity of the former tree can be expressed as a constant, as shown below:
2.1.3 Taylor formula expansion
Taylor’s formula is a method of approximating a function with a derivative of order at point by using a polynomial of degree about. If the function has an order derivative on a closed interval and an order derivative on an open interval, then for any point on the closed interval:
The polynomial is called Taylor’s expansion of the function at, is the remainder of Taylor’s formula and is an infinitesimal of higher order.
According to Taylor’s formula, Taylor’s second-order expansion of the function at point can be obtained as follows:
Returning to XGBoost’s objective function, corresponding to the loss function, corresponding to the predicted value of the previous tree, corresponding to the first tree we are training, the loss function can be written as:
Where, is the first derivative of the loss function, and is the second derivative of the loss function. Note that the derivative here is the derivative with respect to.
Let’s take the squared loss function as an example:
Is:
Put the above second-order expansion into the objective function of XGBoost, and the approximate value of the objective function can be obtained:
Because in the first step is actually a known value, so is a constant, which will not affect the optimization of the function. Therefore, all constant terms are removed, and the objective function is obtained as follows:
So we just need to figure out the first derivative and the second derivative of the loss function at each step (because the previous one is known, so these two values are constants), and then optimize the objective function to get the first derivative of each step, and finally get a global model based on the addition model.
2.1.4 Defining a Tree
We know that XGBoost’s base model supports not only decision trees but also linear models. In this paper, we mainly introduce objective functions based on decision trees. We can redefine a decision tree that consists of two parts:
- Weight vector of leaf node;
- The mapping relationship between instance (sample) and leaf node (essentially branch structure of tree);
2.1.5 Defining the tree complexity
The complexity of the decision tree The less may consist of leaf number, leaf node model is simple, in addition to all leaf nodes should not have too much weight (the weight of each variable in the analog LR), so the objective function of the regularization is driven by the generation of all the decision tree leaf node number, and all nodes weight vector of the paradigm of common decision.
2.1.6 Leaf nodes are grouped
We divide all samples belonging to the first leaf node into the sample set of a leaf node, which is mathematically expressed as:, then the objective function of XGBoost can be written as:
Type on the second line to the third line may look not very clear, clear up here: the second line is to iterate through all the samples after the loss function of each sample, but sample will ultimately falls on the leaf nodes, so that we can traverse the leaf node, and then get the sample collection on the leaf nodes, and then asks the loss function. That is, we used to be a single sample, but now we are rewritten as a set of leaf nodes. Since there are multiple samples in a leaf node, we have the two terms and, which are the values of the first leaf node.
To simplify the expression, we define,, with the following meanings:
- : The sum of the first partial derivatives of the samples contained in the leaf node is a constant;
- : The sum of the second partial derivatives of the samples contained in leaf nodes is a constant;
If and are substituted into XGBoost’s objective function, the final objective function is:
Here we should note that and is the result obtained in the previous step, whose value is known and can be regarded as a constant, and only the leaf node of the last tree is uncertain.
2.1.7 Tree structure scoring
Recall the junior high school mathematics knowledge, suppose there is a quadratic function of one variable, the form is as follows:
We can easily find the maximum value point by applying the maximum value formula of a quadratic function of one variable:
So going back to XGBoost’s final objective function, how do we figure out its maximum value?
Let’s start with a quick analysis of the above formula:
- For each leaf node, it can be disassembled from the objective function:
In 2.1.5 we mentioned that and are computable relative to the first tree. So, this equation is a quadratic function with one variable only including the weight of the leaf node, and we can calculate its maximum value point by the maximum value formula.
- By analyzing the objective function again, it can be found that the objective subexpressions of each leaf node are independent of each other, that is to say, the whole objective function can reach the maximum value point only when the subexpressions of each leaf node reach the maximum value point.
Then, assuming that the current structure of the tree has been fixed, applying the most valued formula of unary quadratic function, taking the first derivative of the objective function pair and making it equal, the corresponding weights of leaf nodes can be obtained:
Therefore, the objective function can be simplified as:
The figure above shows an example of objective function calculation. Calculate the first and second derivatives of each node and each sample, and then sum the samples for each node to get the sum. Finally, the objective function can be obtained by traversing the nodes of the decision tree.
2.2 Generation details of a tree
2.2.1 Optimal segmentation point partitioning algorithm
In the actual training process, when building the first tree, a very critical problem is how to find the optimal point of splitting the leaf node. XGBoost supports two methods of splitting the node — greedy algorithm and approximate algorithm.
(1) Greedy algorithm
Start with a tree depth of 0:
- Enumerate all available characteristics for each leaf node;
- For each feature, the training samples belonging to the node were arranged in ascending order according to the eigenvalue, and the optimal splitting point of the feature was determined by linear scanning, and the splitting benefit of the feature was recorded.
- The feature with the greatest benefit was selected as the splitting feature, and the optimal splitting point of the feature was used as the splitting location. Two new leaf nodes were split on this node, and corresponding sample sets were associated for each new node.
- Go back to step 1 and recursively execute until certain conditions are met;
So how do you calculate the split benefit for each feature?
Assuming that we complete feature splitting at a certain node, the objective function before splitting can be written as:
The objective function after splitting is:
For the objective function, the profit after splitting is:
Note: This feature benefit can also be used as an important basis for the output of feature importance.
For each split, we need to enumerate all possible split schemes for features. How to efficiently enumerate all splits? Suppose we want to enumerate some feature all samples of these conditions, and for a particular partition point we want to calculate the sum of the derivatives of the left and the right.
We can find that for all the split points, we can enumerate all the segmentation gradients and, just by scanning from left to right. Then use the formula above to calculate the benefits of each split scheme.
If we observe the benefits after splitting, we can find that node splitting does not necessarily make the result better, because we have a penalty term for introducing new leaves, that is to say, if the gain brought by the introduced splitting is less than a threshold value, we can cut the splitting.
(2) Approximation algorithm
Greedy algorithm can get the optimal solution, but when the amount of data is too large, it can not be read into the memory for calculation, approximate algorithm mainly aims at greedy algorithm this shortcoming gives approximate optimal solution.
For each feature, only looking at loci can reduce computational complexity.
The algorithm first proposes candidate segmentation points according to the quantile of feature distribution, then maps continuous features into buckets divided by these candidate points, and aggregates statistics to find the optimal splitting points of all intervals.
There are two strategies for proposing candidate segmentation points:
- Global: Propose candidate segmentation points before learning each tree, and use this segmentation for each split;
- Local: Candidate sharding points are re-proposed before each split. Intuitively, the Local strategy requires more calculation steps, while the Global strategy requires more candidate points because nodes have been divided.
The following figure shows the AUC change curves of different splitting strategies. The abscissa is the number of iterations, the ordinate is the AUC of the test set, eps is the accuracy of the approximation algorithm, and its reciprocal is the number of buckets.
As can be seen from the figure above, Global strategy has similar accuracy when there are many candidate points (small EPS) and Local strategy has similar accuracy when there are few candidate points (large EPS). In addition, we also found that the quantile strategy can obtain the same accuracy as the greedy algorithm when eps is reasonable.
In simple terms, the approximation algorithm is to determine a candidate segmentation point according to the distribution of features, and then put the corresponding samples into the corresponding bucket according to these candidate segmentation points, and accumulate the values of each bucket. Finally, greedy search is made on the set of candidate segmentation points. The algorithm is described as follows:
Algorithm explanation:
- First for loop: find the candidate set of cut points for the feature according to the quantile of the feature distribution. The idea here is to extract the partitioned points without going through all the partitioned points. The method of obtaining the candidate cutting point of a feature is called proposal(strategy). XGBoost supports both Global and Local policies.
- The second for loop: the value of each feature is mapped to the bucket interval divided by the set of candidate points corresponding to the feature, i.e. The sample statistics in each bucket interval are summed up, and the optimal splitting point is finally found on these accumulated statistics. And the purpose of this is to get the value of the candidate segmentation points for each feature.
The following figure shows a concrete example of the approximation algorithm, using the quintile as an example:
Sorted according to the characteristics of the samples, and then divided based on the quantile, and counted the values in the three buckets, and finally solved the gain of node division.
2.2.2 Weighted quantile thumbnails
In fact, XGBoost is not simply quantified by the number of samples, but by the second derivative value as the weight of the sample. In order to deal with the selection of candidate segmentation points with weights, the Weighted Quantile Sketch algorithm is proposed. The weighted quantile sketching algorithm presents a data structure that supports merge and Prune operations. The author gave a detailed description of the algorithm and a link to prove it in the paper, which is now invalid, but the algorithm is described in the APPENDIX section of the latest XGBoost paper on arXiv at arxiv.org/abs/1603.02… . Now we briefly introduce the selection method of the candidate points in the weighted quantile sketch diagram, as follows:
So why do we use second order gradients for sample weights?
We know that the objective function of the model is:
We sorted out the objective function formula into the following form, so that we could see the effect on loss weighting.
Where, it is added because and are the loss function derivatives of the previous round and constant are constant. We can see that it’s the weight of the sample in the squared loss function.
2.2.3 Sparse perception algorithm
In practical engineering, the input values are sparse. For example, data loss and one-HOT encoding will cause sparse input data. XGBoost only considers the data traversal of non-missing values in the process of building nodes of the tree, and adds a default direction for each node. When the corresponding eigenvalue of the sample is missing, it can be classified to the default direction, and the optimal default direction can be learned from the data. As for how to learn the branches of default values, it is actually very simple. The default samples of enumeration features are classified as the gains of the left and right branches respectively, and the enumeration item with the largest gain is the optimal default direction.
Enumerating samples with missing features during tree construction may at first appear to double the amount of computation, but it is not. Because only the traversal of non-missing value data is considered in the iteration of the algorithm, the missing value data is directly allocated to the left and right nodes, and the sample size to be traversed is greatly reduced. The author carried out experiments on allstate-10K data set, and the results show that the sparse algorithm is more than 50 times faster than the ordinary algorithm in data processing.
3. Engineering implementation of XGBoost
3.1 Column block parallel learning
One of the most time-consuming steps in tree generation is the need to sort the values of features each time the optimal split point is sought. XGBoost will sort the data according to the features before training, and then store the data in block structure. In each block structure, Sparse Columns Format (CSC) is used for storage, and the block structure will be used repeatedly in the later training process. Can greatly reduce the amount of computation.
In order to obtain the first and second derivative values of the samples, the author proposed that the sorted eigenvalues and the references of the corresponding samples should be saved in the block by dividing and ordering the samples according to the features. The specific method is shown as follows:
By accessing the ordered blocks to traverse the eigenvalues of the sample features, it is convenient to search the segmentation points. In addition, after block storage, multiple features do not interfere with each other, and multiple threads can be used to search for different features at the same time, that is, feature parallelization. In the process of node splitting, the feature with the maximum gain should be selected as the splitting, and the gain calculation of each feature can be carried out simultaneously, which is also the reason why XGBoost can realize distributed or multi-threaded computation.
3.2 Cache Access
Columns of the design of parallel learning can reduce the computational burden of the split node, in order to access the eigenvalue, access to a block of contiguous memory space, but held by eigenvalue index index (sample) when accessing a sample for first order and second order derivative, the access operation access memory space is not continuous, this may cause the CPU cache hit ratio is low, affect the efficiency of algorithm.
In order to solve the problem of low cache hit ratio, XGBoost proposed a cache access algorithm: allocate a continuous cache area for each thread and store the gradient information needed in the buffer, so as to realize the conversion from the discontinuous space to the continuous space and improve the efficiency of the algorithm. Cache optimization can also be facilitated by adjusting block sizes appropriately.
3.3 “outside the core” block calculation
When the amount of data is very large, we can’t load all the data into memory. Some of the data that needs to be loaded into memory must be stored in the hard disk and loaded into memory when needed. This operation has an obvious bottleneck, that is, the DISK I/O operation speed is far lower than the memory processing speed, there must be a large number of waiting disk I/O operations. To solve this problem, the author proposes an optimization method of “out of kernel” calculation. The specific operation is that the data set is divided into multiple blocks and stored in the hard disk. An independent thread is used to read data from the hard disk and load it into the memory. In this way, the algorithm can process data in the memory and read data from the hard disk at the same time. In addition, XGBoost uses two methods to reduce the overhead of disk reads and writes:
- Block Compression. The paper is compressed by column and decompressed by another thread when it is read. For row indexes, only the first index value is stored, and then the difference from the first index of the block is stored as a 16-bit integer. The author tested that the compression ratio was almost achieved when the block was set to a sample size.
- Block Sharding. Block partitioning stores feature block partitions on different disks to increase I/O throughput of disks.
4. XGBoost’s pros and cons
4.1 the advantages
- Higher accuracy: GBDT only uses first-order Taylor expansion, while XGBoost performs second-order Taylor expansion on the loss function. XGBoost introduces the second derivative in order to increase the accuracy on the one hand and customize the loss function on the other hand. The second order Taylor expansion can approximate a large number of loss functions.
- More flexibility: GBDT uses CART as a base classifier, XGBoost supports not only CART but also linear classifiers. XGBoost using linear classifiers is equivalent to logistic regression with and regularized terms (classification problem) or linear regression (regression problem). In addition, the XGBoost tool supports custom loss functions, as long as the function supports first – and second-order derivatives;
- Regularization: XGBoost adds regularization to the objective function to control the complexity of the model. The regular term contains the number of leaf nodes in the tree and the normal form of the weight of leaf nodes. The regularization term reduces the variance of the model, making the learned model simpler and helping to prevent overfitting, which is also a feature of XGBoost that is superior to traditional GBDT.
- Shrinkage: Equivalent to learning rate. After an iteration, XGBoost multiplicates the weight of the leaf nodes by this coefficient, mainly to reduce the influence of each tree and allow more learning space later on. The traditional GBDT implementation also has the learning rate;
- Column sampling: XGBoost takes a page from the random forest playbook and supports column sampling, which not only reduces overfitting, but also reduces computation. This is another feature of XGBoost that differentiates it from traditional GBDT;
- Missing value processing: For samples with missing feature values, the sparse sensing algorithm adopted by XGBoost can automatically learn its splitting direction;
- The XGBoost tool supports parallelism: Boosting is a serial construct, isn’t it? How are they parallel? Note that XGBoost parallelism is not tree-grained parallelism, and XGBoost can only proceed to the next iteration after each iteration (the cost function of the first iteration contains the predicted value of the previous iteration). XGBoost parallelism is at feature granularity. As we know, one of the most time-consuming steps in decision tree learning is sorting the values of features (because the optimal segmentation point needs to be determined). XGBoost sorts the data in advance before training, and then saves it into a block structure, which is repeatedly used in subsequent iterations, greatly reducing the amount of computation. This block structure also makes parallelism possible. When nodes are split, the gain of each feature needs to be calculated. Finally, the feature with the largest gain is selected to split, so the gain calculation of each feature can be carried out by multi-threading.
- Parallel approximation algorithm: when tree nodes are split, we need to calculate the gain corresponding to each segmentation point of each feature, that is, greedy enumeration of all possible segmentation points. Greedy algorithms become inefficient when data cannot be loaded into memory once or in distributed situations, so XGBoost also proposes a parallel approximation algorithm to efficiently generate candidate segmentation points.
4.2 disadvantages
- Although pre-sorting and approximation algorithms can reduce the computational cost of finding the optimal splitting point, data sets still need to be traversed during node splitting.
- The spatial complexity of the pre-sorting process is too high. It needs to store not only the eigenvalues but also the index of the gradient statistics of the sample corresponding to the eigenvalues, which consumes twice the memory.
5. XGBoost instance
All data sets and code for this article are on my GitHub at github.com/Microstrong…
5.1 Installing XGBoost Dependency Packages
pip install xgboost
Copy the code
5.2 XGBoost classification and regression
XGBoost has two major categories of interfaces: the XGBoost native interface and the SciKit-Learn interface, and XGBoost can perform both classification and regression tasks.
(1) Classification based on XGBoost native interface
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
# read in the iris data
iris = load_iris()
X = iris.data
y = iris.target
# split train data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565)
# set XGBoost's parameters
params = {
'booster': 'gbtree'.'objective': 'multi:softmax'.# set regression task to: 'objective': 'reg:gamma',
'num_class': 3.Regression tasks do not have this parameter
'gamma': 0.1.'max_depth': 6.'lambda': 2.'subsample': 0.7.'colsample_bytree': 0.7.'min_child_weight': 3.'silent': 1.'eta': 0.1.'seed': 1000.'nthread': 4,
}
plst = params.items()
dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 500
model = xgb.train(plst, dtrain, num_rounds)
# Make predictions for the test set
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)
# Calculation accuracy
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
if ans[i] == y_test[i]:
cnt1 += 1
else:
cnt2 += 1
print("Accuracy: %.2f %% " % (100 * cnt1 / (cnt1 + cnt2)))
# Show important features
plot_importance(model)
plt.show()
Copy the code
(2) Regression based on sciKit-learn interface
House Prices: Advanced Regression Techniques, www.kaggle.com/c/house-pri… To illustrate examples.
There are altogether 81 columns in the training data set of this housing price prediction. The first column is Id, the last column is label, and the middle 79 columns are features. Of the 79 characteristic columns, 43 are split-type variables, 33 are integer variables, and 3 are floating-point variables. There are missing values in the training data set.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
import xgboost as xgb
from sklearn.metrics import mean_absolute_error
# 1. Read the file
data = pd.read_csv('./dataset/train.csv')
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
# 2. Syncopated data input: feature output: predicted target variables
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
# 3. Split training set and test set, the split ratio is 7.5:2.5
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)
# 4. Null value processing, default method: fill with the average value of the feature column
my_imputer = SimpleImputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)
# 5. Call the XGBoost model and train with the training set data (fit)
# Add verbosity=2 to print messages while running boosting
my_model = xgb.XGBRegressor(objective='reg:squarederror', verbosity=2) # xgb.xgbclassifier () XGBoost classifier
my_model.fit(train_X, train_y, verbose=False)
# 6. Use the model to predict the test set data
predictions = my_model.predict(test_X)
# 7. Evaluate the predicted results of the model (mean absolute error)
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))
Copy the code
5.3 XGBoost adjustable parameters
In the previous section, the XGBoot model parameters used the default parameters of the model, but the default parameters were not the best. To make XGBoost perform better, you need to tweak the parameters of the XGBoost model. The following figure shows the parameters that the classification model needs to adjust, and the parameters that the regression model needs to adjust are similar.
6. Some thoughts on XGBoost
6.1 What are the connections and differences between XGBoost and GBDT?
(1) GBDT is a machine learning algorithm, and XGBoost is the engineering implementation of the algorithm. (2) Regular terms: When CART is used as the base classifier, XGBoost explicitly adds regular terms to control the complexity of the model, which is conducive to preventing over-fitting and thus improving the generalization ability of the model. (3) Derivative information: GBDT only uses the first-order derivative information of the cost function in model training, XGBoost carries out second-order Taylor expansion of the cost function, and the first-order and second-order derivatives can be used simultaneously. (4) Base classifier: Traditional GBDT uses CART as the base classifier, XGBoost supports many types of base classifier, such as linear classifier. (5) Sub-sampling: Traditional GBDT uses all data in each iteration, while XGBoost adopts a strategy similar to random forest to support data sampling. (6) Missing value processing: traditional GBDT is not designed to process missing values, XGBoost can automatically learn the missing value processing strategy. (7) Parallelization: traditional GBDT does not carry out parallelization design. Note that it is not the parallelism of tree dimension, but the parallelism of feature dimension. XGBoost will arrange each feature according to the feature value in advance and store it as a block structure. When splitting nodes, multithreading can be used to find the best segmentation point of each feature in parallel, which greatly improves the training speed.
6.2 Why is the effect better after XGBoost Taylor second-order expansion?
(1) From the perspective of why it is thought to introduce Taylor’s second order (scalability) : According to the official website of XGBoost, when the objective function is MSE, the expansion is in the form of first-order term (residual) + second-order term, while the expansion of other objective functions, such as Logistic Loss, does not have such a form. In order to have a uniform form, Taylor expansion is used to obtain the second order terms, so that the set of MSE derivation can be directly reused to other custom loss functions. In short, to unify the form of loss function derivatives to support custom loss functions. As for why the formal unification with MSE? Because MSE is the most common and commonly used loss function, and the derivation is the easiest, and the form after derivation is also very simple. So theoretically, as long as the loss function form is consistent with MSE, it’s just MSE derivation.
(2) From the nature of the second derivative itself, that is, from the perspective of why Taylor’s second order expansion is used (accuracy) : the second order information itself can make the gradient convergence faster and more accurate. This has been proved by Newton’s method in the optimization algorithm. You can simply say that the first derivative guides the direction of the gradient, and the second derivative guides how the direction of the gradient changes. To put it simply, compared with GBDT’s first-order Taylor expansion, XGBoost adopts second-order Taylor expansion, which can approach the real loss function more accurately.
6.3 How does XGBoost handle missing values?
In the common GBDT strategy, the method for missing values is to fill them manually first, and then treat them as valuable features. However, such manual filling may not be accurate, and there is no theoretical basis. However, the strategy adopted by XGBoost is not to deal with the samples with missing values, but to use the samples with values to create split points. When traversing each valued feature, XGBoost tries to divide the missing samples into left and right subtrees, and selects the value with the optimal loss as the split point.
6.4 Why can XGBoost be trained in parallel?
(1) The parallelism of XGBoost does not mean that each tree can be trained in parallel. In essence, XGBoost still adopts boosting idea, and each tree needs to wait for the previous tree training to be completed before starting training.
(2) Parallelism of XGBoost refers to parallelism of feature dimensions: Before training, each feature is pre-sorted according to the feature value of the sample and stored as a Block structure, which can be used repeatedly in the later search for feature segmentation points. Moreover, features have been stored as a Block structure, so when searching for the optimal segmentation point of each feature, multi-threading can be used for parallel calculation of each Block.
7. Reference
Due to the reference of more literature, I put each part of the key reference to which article in detail. [1] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System for Scalable Devices [J]. Plos One, 2012, 7 (6) : E013612. [3] understanding of xgboost Jin Guitao articles – zhihu https://zhuanlan.zhihu.com/p/75217528 [4] CTR forecast paper intensive reading (a) – xgboost, address: Blog.csdn.net/Dby_freedom… [5] XGBoost paper reading and principle – Salon sai articles – zhihu https://zhuanlan.zhihu.com/p/36794802 [6] XGBoost papers translation + personal comments, address: Blog.csdn.net/qdbszsj/art… XGBoost algorithm explanation: [7] XGBoost super detailed derivation, finally someone spoke understand! Address: mp.weixin.qq.com/s/wLE9yb7Mt… [8] Finally someone explained XGBoost and LightGBM, the most mainstream integration algorithms in the project! , address: https://mp.weixin.qq.com/s/LoX987dypDg8jbeTJMpEPQ [9] the difference between GBDT in machine learning algorithms and XGBOOST have? – the wepon answer – zhihu www.zhihu.com/question/41… [10] GBDT algorithm principle and system design briefs wepon, address: http://wepon.me/files/gbdt.pdf XGBoost instance: [11] Kaggle artifact XGBoost, address: www.jianshu.com/p/7e0e2d66b… [12] dry | XGBoost in ctrip search rankings, the application of address: https://mp.weixin.qq.com/s/X4K6UFZPxL05v2uolId7Lw [13] in the history of the most detailed XGBoost combat – chapter huayan articles – zhihu zhuanlan.zhihu.com/p/31182879 【 14 】 XGBoost model building process and model parameter tuning (housing forecast with code to explain) – artificial intelligence academic frontier articles – zhihu https://zhuanlan.zhihu.com/p/61150141 XGBoost interview questions: 【 15 】 collector | 20 XGBoost interview problem, would you several? (1), address: mp.weixin.qq.com/s/_QgnYoW82… 【 16 】 collector | 20 XGBoost interview problem, would you several? (the next), address: https://mp.weixin.qq.com/s/BbelOsYgsiOvwfwYs5QfpQ [17] recommended collection | 10 XGBoost interview problem for you, address: mp.weixin.qq.com/s/RSQWx4fH3… [18] Interview question: How does XGBoost rate features? , address: https://mp.weixin.qq.com/s/vjLPVhg_UavZIJrOzu_u1w [19] [school recruit – based algorithm] GBDT/XGBoost FAQ – Jack – zhihu Stark Zhuanlan.zhihu.com/p/81368182 [20] editor-in-chief zhuge the best face of machine learning, gourd dolls, P295 – P297. [21] Soul Torture, have you read Xgboost? Light rain girl articles – zhihu zhuanlan.zhihu.com/p/86816771 [22] why xgboost second-order Taylor expansion effect is good? – Zsank answer – zhihu https://www.zhihu.com/question/277638585/answer/522272201
“`php
Highlights of past For beginners entry route of artificial intelligence and data download AI based machine learning online manual deep learning online manual download update (PDF to 25 sets) note: WeChat group or qq group to join this site, please reply “add group” to get a sale standing knowledge star coupons, please reply “planet” knowledge like articles, point in watching
Copy the code