Use XGBoost in Python

This article was originally an official documentation tutorial for XGBoost, but I rewrote it because some of it was unclear and some of it did have some problems. Please go to Github to download the data

Lead the code

Reference the class library and add the required functions

import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd
import matplotlib
%matplotlib inline
Copy the code

def GetNewDataByPandas() :
    wine = pd.read_csv("/Data/UCI/wine/wine.csv")
    wine['alcohol**2'] = pow(wine["alcohol"].2)
    wine['volatileAcidity*alcohol'] = wine["alcohol"] * wine['volatile acidity']
    y = np.array(wine.quality)
    X = np.array(wine.drop("quality", axis=1))
    # X = np.array(wine)

    columns = np.array(wine.columns)

    return X, y, columns
Copy the code

Loading directly from CSV is also possible

file_path = "/home/fonttian/Data/UCI/wine/wine.csv"
data = np.genfromtxt(file_path, delimiter=', ')
dtrain_2 = xgb.DMatrix(data[1:1119.0:11], data[1:1119.11])
dtest_2 = xgb.DMatrix(data[1119:1599.0:11], data[1119:1599.11])
Copy the code

Load the data

Read the data and split it

# Read wine quality data from file
X, y, wineNames = GetNewDataByPandas()
# X, y, wineNames = GetDataByPandas()
# split data to [0.8,0.2,01]
x_train, x_predict, y_train, y_predict = train_test_split(X, y, test_size=0.10, random_state=100)


# take fixed holdout set 30% of data rows
x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=0.2, random_state=100)
Copy the code

Display data

wineNames
Copy the code

array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol',
       'quality', 'alcohol**2', 'volatileAcidity*alcohol'], dtype=object)
Copy the code

print(len(x_train),len(y_train))
print(len(x_test))
print(len(x_predict))
Copy the code

1151 1151
288
160
Copy the code

Loaded into the DMatrix

Missing will be used as the default value to fill in missing values
You can also set weights if necessary

dtrain = xgb.DMatrix(data=x_train,label=y_train,missing=-999.0)
dtest = xgb.DMatrix(data=x_test,label=y_test,missing=-999.0)

# w = np.random.rand(5, 1)
X_train = xgb.dmatrix (x_train, label=y_train, missing=-999.0, weight=w)
Copy the code

Set the parameters

Booster parameters

param = {'max_depth': 7.'eta': 1.'silent': 1.'objective': 'reg:linear'}
param['nthread'] = 4
param['seed'] = 100
param['eval_metric'] = 'auc'
Copy the code

You can also specify multiple ECAL metrics

param['eval_metric'] = ['auc'.'ams@0']
# here we do regression and set rmSE only
param['eval_metric'] = ['rmse']
param['eval_metric']
Copy the code

['rmse']
Copy the code

Specifies validation that is set to monitor performance

evallist = [(dtest, 'eval'), (dtrain, 'train')]
Copy the code

training

Training model

In the previous code, I split the data into 6:3:1, training data, performance monitoring data, and final prediction data. This ratio is for example only and is not representative.

In the case of insufficient data set, in addition to the prediction data, it is also possible not to separate the training set from the validation set. It is also feasible to use the cross-validation method and not applicable to the performance monitoring data (validation set). Please think for yourself and make your choice.

num_round = 10
bst_without_evallist = xgb.train(param, dtrain, num_round)
Copy the code

num_round = 10
bst_with_evallist = xgb.train(param, dtrain, num_round, evallist)
Copy the code

[0] eval-rmSE :0.793859 train-rmse:0.68806 [1] Eval-RMse :0.796485 train-RMse :0.474253 [2] Eval-RMse :0.796662 [4] Eval-rmSE :0.789566 train-rmse:0.340342 [5] [6] EVAL-RMSE :0.804898 train-RMse :0.244093 [7] EVAL-RMSE :0.813786 [8] eval-rmSE :0.190969 [9] Eval-RMSE :0.814219 train-rmse:0.159447Copy the code

Model persistence

models_path = "/home/fonttian/Models/Sklearn_book/xgboost/"
bst_with_evallist.save_model(models_path+'bst_with_evallist_0001.model')
Copy the code

Model and feature maps can also be transferred to text files

# dump model
bst_with_evallist.dump_model(models_path+'dump.raw.txt')
# dump model with feature map
bst_with_evallist.dump_model(models_path+'dump.raw.txt', models_path+'featmap.txt')
Copy the code

A saved model can be loaded as follows:

bst_with_evallist = xgb.Booster({'nthread': 4})  # init model
bst_with_evallist.load_model(models_path+'bst_with_evallist_0001.model')  # load data
Copy the code

Early stop

If you have a verification set, you can use boosting Rounds (gradient times) to find the optimal number of stops ahead of time. Stopping early requires at least one EVals collection. If there are more than one, it uses the last one.

train(... , evals=evals, early_stopping_rounds=10)
Copy the code

The model will begin training until verification scores stop improving. Validation errors need to be reduced by at least each early_stopping_rounds to continue training.

If stopped early, the model will have three additional fields: bst.best_score, bst.best_Iteration, and bst.best_ntree_limit. Note that train() will return a model from the last iteration, not the best one.

This is used with two metrics to minimize (RMSE, logarithmic loss, etc.) and maximize (MAP, NDCG, AUC). Note that if you specify multiple metrics, the last one in parAM [‘ eval_metric ‘] is used for early stopping.

bst_with_evallist_and_early_stopping_10 = xgb.train(param, dtrain, num_round*100, evallist,early_stopping_rounds=10)
Copy the code

[0] Eval-rmSE :0.793859 train- RMSE :0.68806 Multiple eval metrics have been passed: 'train-rmse' will be used for early stopping. Will train until train-rmse hasn't improved in 10 rounds. [1] [2] eval-RMSE :0.796662 train-rmse:0.450195 [3] EVAL-RMse :0.778571 "Train" - rmse: 0.400886... [57] eval-rmSE :0.822859 train-rmse:0.000586 Stopping. Best Iteration: [48] the eval - rmse: train rmse - 0.822859:0.000586Copy the code

bst_with_evallist_and_early_stopping_100 = xgb.train(param, dtrain, num_round*100, evallist,early_stopping_rounds=100)
Copy the code

[0] Eval-rmSE :0.793859 train- RMSE :0.68806 Multiple eval metrics have been passed: 'train-rmse' will be used for early stopping. Will train until train-rmse hasn't improved in 100 rounds. [1] Eval-rmse :0.796485 train-rmSE :0.474253 [2] eval-RMse :0.796662 train-rmse:0.450195... [148] EVAL-RMSE :0.822859 train-RMSE :0.000586 Stopping. Best Iteration: [48] Eval-RMSE :0.822859 train-RMSE :0.000586Copy the code

To predict

Predicted results

Once you have trained/loaded a model and the data ready, you are ready to make predictions.

dpredict = xgb.DMatrix(x_predict)
# Nothing
ypred_without_evallist = bst_without_evallist.predict(dpredict)
# No early stop
ypred_with_evallist = bst_with_evallist.predict(dpredict)
# has stopped earlyypred_with_evallist_and_early_stopping_100 = bst_with_evallist_and_early_stopping_100.predict(dpredict,ntree_limit=bst_with_evallist_and_early_stopping_100.best_ntre e_limit)Copy the code

Test error

Now we can compare the performance of the models from the previous three data usage patterns. The effect is as follows. However, it is worth noting that this code is mainly to show how to use, does not represent universal. In fact, the end result of this model is actually pretty bad, and I’ll show you other uses and data mining techniques on the same data set in a later blog post to get a more obvious model.

from sklearn.metrics import mean_squared_error
print("RMSE of bST_without_evallist:, np.sqrt(mean_squared_error(y_true=y_predict,y_pred=ypred_without_evallist)))

print("RMSE of bST_with_evallist:", np.sqrt(mean_squared_error(y_true=y_predict,y_pred=ypred_with_evallist)))

print("RMSE of bST_with_EVALLIST_AND_early_stopping_100:", np.sqrt(mean_squared_error(y_true=y_predict,y_pred=ypred_with_evallist_and_early_stopping_100)))
Copy the code

RMSE of bst_without_evallist ： 0.7115641528672897
RMSE of bst_with_evallist ： 0.7115641528672897
RMSE of bst_with_evallist_and_early_stopping_100 ： 0.7051095825211103
Copy the code

drawing

You can use the Plotting module to plot the importance and output tree. If you want to output the importance ranking directly, you can use the get_score method or the get_fscore method, the difference being that the former can be weighted with the importance_type parameter.

To draw importance, use plot_importance. This function requires matplotlib installed.

xgb.plot_importance(bst_with_evallist_and_early_stopping_100)
Copy the code

<matplotlib.axes._subplots.AxesSubplot at 0x7f2707a6a080>
Copy the code

The output tree is displayed via matplotlib, which uses plot_tree to specify the sequence number of the target tree. This function requires Graphviz and matplotlib. Note that Graphviz not only needs to be installed via PIP Install Graphviz, but also needs to be installed on your system, otherwise this part of XgBoost will not work. Here you can refer to another article of mine

We also need to enter a Matplotlib AX for plot_tree to control the size of the output image.

# xgb.plot_tree(bst, num_trees=2)
xgb.to_graphviz(bst_with_evallist_and_early_stopping_100, num_trees=2)
Copy the code

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([18.5.18.5.10.5.10.5])
xgb.plot_tree(bst_with_evallist_and_early_stopping_100, num_trees=2,ax=ax)
plt.show()
Copy the code