Written in the book of the former
GBM(Gredient Boosting Machine) is indispensable for data competition, and the most common ones are XGBoost and LightGBM.
Models are especially important in data contests, but in fact, most of our friends spend relatively little time on models during contests, preferring to spend precious time on feature extraction and model fusion. In actual practice, we will make a baseline demo first, and explore the potential of the model as quickly as possible, so that we can spend energy on the feature and model fusion in the later period. This is where you need some tuning skills.
This paper selects more than ten important parameters from the two models. And show you a quick and lightweight way to tune parameters. Of course, if you have higher requirements, you still have to poke LightGBM and XGBoost, the two official documentation links.
In order to better test, I put the pre-processed data on Baidu Cloud. Click the link to download it.
Some important parameters of XGBoost
XGBoost’s parameters fall into three categories:
- General parameters: macro function control.
- Booster parameter: Controls the Booster (tree/regression) for each step. Booster parameters can generally control the effect and calculation cost of the model. The so-called parameter tuning is largely about adjusting the Booster parameter.
- Learning objective parameters: Control the performance of training objectives. Our division of questions is mainly reflected in the parameters of learning objectives. For example, whether we want to do classification or regression, whether we want to do dichotomy or multi-classification, that’s what the target parameter provides.
Note: The following parameters are all important to me. Please click here for complete parametersThe official documentation
General parameters
booster
: We have two options for parameters,gbtree
andgblinear
. Gbtree uses tree structure to run data, while GblineAR is based on linear model.silent
: Silent mode, for1
When the model runs without output.nthread
: Uses the number of threads, normally we set to- 1
, using all threads. If we need to, we’re going to use as many threads as we want.
Booster parameters
-
Num_boosting_rounds n_estimator:
This is the maximum number of trees generated and the maximum number of iterations.
-
Learning_rate: sometimes called eta, the default value is 0.3.
The step size of each iteration is important. If it’s too big, it’s not accurate. If it’s too small, it’s slow. We generally use a smaller value than the default, 0.1 or so is good.
-
Gamma: The system defaults to 0, and we often use 0.
When a node is split, the node will be split only if the value of the loss function decreases after the split. Gamma specifies the minimum loss function drop value required for node splitting. The greater the value of this parameter, the more conservative the algorithm. The larger the gamma is, the more the loss function drops before splitting the node. So trees are less likely to split nodes when they’re being generated. Scope: [0, up]
-
Subsample: The default value is 1.
This parameter controls the rate of random sampling for each tree. By decreasing the value of this parameter, the algorithm becomes more conservative and avoids overfitting. However, if this value is set too small, it may result in an underfit. Typical value: 0.5-1. 0.5 represents average sampling to prevent overfitting. Range: (0,1), note that 0 is not advisable
-
Colsample_bytree: The default value is 1. We typically set it to about 0.8.
Used to control the proportion of the number of randomly sampled columns in each tree (each column is a feature). Typical value: 0.5-1 range: (0,1)
-
Colsample_bylevel: The default is 1, which we also set to 1.
This is more detailed than the previous one, which refers to the percentage of columns sampled for each node split in each tree
-
Max_depth: The default value is 6
We often use numbers between 3 and 10. This value is the maximum depth of the tree. This value is used to control overfitting. The greater the max_depth, the more specific the model is learned. Set to 0 for no limit, range: [0,∞]
-
Max_delta_step: the default is 0. We usually use 0.
This parameter limits the maximum stride length for each tree weight change, and a value of 0 means there is no constraint. If it is given a positive value, then the algorithm is more conservative. Normally, this parameter does not need to be set, but it can be useful to the logistic regression optimizer when the sample size of the category is extremely unbalanced.
-
Lambda: also called reg_lambda, the default value is 0.
The L2 regularized term of the weight. (Similar to Ridge Regression). This parameter is used to control the regularization part of XGBoost. This parameter is very helpful in reducing overfitting.
-
Alpha: also called reg_alpha defaults to 0,
The L1 regularization term of the weight. (Similar to Lasso Regression). It can be applied to very high dimensional cases, making the algorithm faster.
-
Scale_pos_weight: Default value 1
When all kinds of samples are very imbalanced, setting this parameter as a positive value can make the algorithm converge faster. This can usually be set to the ratio of the number of negative samples to the number of positive samples.
Learning objective parameter
Objective [default =reg: Linear]
reg:linear
— Linear regressionreg:logistic
— Logistic regressionbinary:logistic
– Binary logistic regression. The output is probabilitybinary:logitraw
– The output result of binary logistic regression is wTxcount:poisson
– Poisson regression of counting problems. The output result is poisson distribution. In poisson regression, max_delta_step defaults to 0.7 (used to safeguard optimization)multi:softmax
— Set num_class (number of categories) for XGBoost to use softmax target function for multiple categoriesmulti:softprob
– Like SoftMax, but outputs a vector of ndata*nclass, where the value is the probability that each data is divided into each class.
Eval_metric [default = selected by objective function]
rmse
: Root mean square errormae
: Average absolute value errorlogloss
: negative log-likelihooderror
: Dichotomous error rate. Its value is obtained by the ratio of the number of wrong categories to the number of all categories. For predictions, values greater than 0.5 are considered positive, while others are classified as negative. error@t: Different thresholds can be set by ‘T’merror
: Multi-category error rate, calculated by wrong cases /(all cases)mlogloss
: multi-category log lossauc
: Area under the curvendcg
: Normalized Discounted Cumulative Gainmap
: Average accuracy
In general, we use the xgboost. Train (params, dtrain) function to train our models. Params here refers to the Booster parameter.
Two basic examples
Note that when we try to do dichotomous processing in XGBoost, we just set binary in Objective and see that the output is still a bunch of consecutive values. This is because it outputs the highest of all the probabilities predicted by the model. We can later condition these probabilities to get the final category, or call the XGBClassifier() class in XGBoost, but the two functions are written differently. Let me give you an example.
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
train_data = pd.read_csv('train.csv') # fetch data
y = train_data.pop('30').values # Pop out the tag value y from the training data as the training target, where '30' is the column name of the tag
col = train_data.columns
x = train_data[col].values The remaining columns are used as training dataTrain_x, train_x, train_y, train_y = train_test_split(x, y, test_size=0.333, random_state=0)# Divide training set and verification set
train = xgb.DMatrix(train_x, train_y)
valid = xgb.DMatrix(valid_x, valid_y) The # train function requires a Dmatrix value, as shown in the code
params = {
'max_depth': 15.'learning_rate': 0.1.'n_estimators': 2000,
'min_child_weight': 5,
'max_delta_step': 0.'subsample': 0.8.'colsample_bytree': 0.7.'reg_alpha': 0.'reg_lambda': 0.4.'scale_pos_weight': 0.8.'silent': True,
'objective': 'binary:logistic'.'missing': None,
'eval_metric': 'auc'.'seed': 1440,
'gamma': 0}# Params specifically refers to the booster parameter. Note that evA_Metric is the evaluation function
xlf = xgb.train(params, train, evals=[(valid, 'eval')],
num_boost_round=2000, early_stopping_rounds=30, verbose_eval=True)
# training, pay attention to the validation set, and the early_stopping method, which refers to stopping training without increasing in 30 iterations
y_pred = xlf.predict(valid_x, ntree_limit=xlf.best_ntree_limit)
# xgboost does not have a mechanism to use the best tree directly as the model. Instead, it uses the maximum tree depth limit method to get the best early_stopping effect just now
auc_score = roc_auc_score(valid_y, y_pred) Calculate the ROC value of the predicted result
Copy the code
This is xgboost. Train (), the original xgboost wrapper function. The output we were trained to predict was a sequence of values that xGBoost was most likely to produce on each of these categories. If we want to get our classification results, we need to do something else.
Fortunately, XgBoost has developed two functions, XGBoostClassifier() and XGBoostRegression(), to match the use of SkLearn with utilities such as GridSearch. Can be more simple and fast classification and regression processing. Note that the Xgboost sklearn package does not have the feature_importance metric, but the get_fscore() function does the same. Of course, in order to be consistent with sklearn, the writing method is also changed, as shown in the following code:
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
train_data = pd.read_csv('train.csv') # fetch data
y = train_data.pop('30').values # Pop out the tag value y from the training data as the training target, where '30' is the column name of the tag
col = train_data.columns
x = train_data[col].values The remaining columns are used as training dataTrain_x, train_x, train_y, train_y = train_test_split(x, y, test_size=0.333, random_state=0)# Divide training set and verification set
# Dmatrix is not needed hereXLF = xgB. XGBClassifier(max_depth=10, Learning_rate =0.01, n_ESTIMators =2000, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=0.85, colsamPLE_bytree =0.7, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=1440, missing=None) xlf.fit(train_x, train_y, eval_metric='error', verbose=True, eval_set=[(valid_x, valid_y)], early_stopping_rounds=30)
If set to 10, it will produce output every 10 iterations.
# eval_metric= 'error' There is no function named accuracy. Most examples on the Internet are AUC, and I put an error here specially.
y_pred = xlf.predict(valid_x, ntree_limit=xlf.best_ntree_limit)
auc_score = roc_auc_score(valid_y, y_pred)
y_pred = xlf.predict(valid_x, ntree_limit=xlf.best_ntree_limit)
# xgboost does not have a mechanism to use the best tree directly as the model. Instead, it uses the maximum tree depth limit method to get the best early_stopping effect just now
auc_score = roc_auc_score(valid_y, y_pred) Calculate the ROC value of the predicted result
Copy the code
So we have introduced so much, the focus will come: how to quickly and well tune parameters? First we need to understand how Grid Search works.
GridSearch profile
This is a way of adjusting parameters; Exhaustive search: Out of all the candidate parameter choices, iterate through each possibility, and the parameter that performs best is the final result. It’s like looking for the maximum value in an array. Why is it called grid search? Taking a model with two parameters as an example, there are three possibilities for parameter A and four possibilities for parameter B. Listing all possibilities can be expressed as a 3*4 table, in which each cell is a grid. The cycle process is like traversing and searching in each grid, so it is called grid Search.)
In fact, this is the same as the traversal that we often use. We recommend that you use the GridSearch function inside skLearn, simple and fast.
import xgboost as xgb
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
train_data = pd.read_csv('train.csv') # fetch data
y = train_data.pop('30').values # Pop out the tag value y from the training data as the training target, where '30' is the column name of the tag
col = train_data.columns
x = train_data[col].values The remaining columns are used as training dataTrain_x, train_x, train_y, train_y = train_test_split(x, y, test_size=0.333, random_state=0)# Divide training set and verification set
# Dmatrix is not needed here
parameters = {
'max_depth': [5, 10, 15, 20, 25]'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
'n_estimators': [500, 1000, 2000, 3000, 5000],
'min_child_weight': [0, 2, 5, 10, 20],
'max_delta_step': [0, 0.2, 0.6, 1, 2],
'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1]} XLF = xgb.xGBclassifier (max_depth=10, learning_rate=0.01, n_estimators=2000, silent=True, XLF = xgB.xGBclassifier (max_depth=10, Learning_rate =0.01, n_estimators=2000, silent=True, objective='binary:logistic', nthread=-1, gamma=0, min_child_weight=1, max_delta_step=0, subsample=0.85, colsamPLE_bytree =0.7, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=1440, missing=None)# With GridSearch we don't need the FIT function
gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(train_x, train_y)
print("Best score: % 0.3 f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Copy the code
It is important to note that Grid Search requires cross-validation support. CV is equal to 3, which is an int, so it’s a 3-fold proof. In fact, a CV can be an object or any other type. They represent different ways of validation. You can see the following statement.
Possible inputs for cv are:
- None, to use the default 3-fold cross-validation,
- integer, to specify the number of folds.
- An object to be used as a cross-validation generator.
- An iterable yielding train/test splits.
Copy the code
References:
Machine learning – XgBoost