The advantage of integration using decision tree methods such as gradient enhancement is that they can automatically provide estimates of feature importance from trained prediction models.

In this article, you’ll find out how to use Python’s XGBoost library to estimate the importance of features to predictive modeling problems. After reading this article, you’ll know:

  • How to calculate feature importance using gradient lifting algorithm.
  • How to plot the importance of features in Python calculated by the XGBoost model.
  • How to use XGBoost to calculate feature importance to perform feature selection.

The importance of features in gradient ascension

The advantage of using gradient enhancement is that it is relatively simple to retrieve the importance score for each attribute after building the enhanced tree. In general, importance provides a score that indicates the usefulness or value of each feature in building an enhanced decision tree in the model. The more attributes are used for key decisions used in decision trees, the higher their relative importance.

This importance is explicitly calculated for each attribute in the data set so that the attributes can be ranked and compared with each other. The importance of individual decision trees is calculated by the number of performance indicators improved by each attribute split point and weighted by the number of observations the node is responsible for. The performance measure can be the purity used to select the resolution point (gini coefficient) or some other, more specific error function. Then, feature importance is averaged across all decision trees in the model. For more technical information on how to calculate the importance of features in an enhanced decision tree, see Elements of Statistical Learning: Data Mining, Reasoning, and Prediction (p. 367), section 10.13.1, “Relative Importance of Predictive Variables.” Also, see Matthew Drury’s answer to the StackOverflow question “The importance of Boosting relative variables”, where he provides a very detailed and practical answer.

Manually draw feature importance

The trained XGBoost model automatically calculates the importance of features in your predictive modeling problem. These importance scores can be obtained in the feature_importances_ member variable of the training model. For example, you can print them directly as follows:

print(model.feature_importances_)
Copy the code

We can plot these scores directly on a bar chart to visualize the relative importance of each feature in the data set. Such as:

# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()
Copy the code

We can demonstrate this by training the XGBoost model on a pima Indian onset diabetes data set and creating bar charts based on the calculated importance of features.

Download the data set and place it in the current working directory.

Data set file:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

Data set details:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

# plot feature importance manually
from numpy import loadtxt
from xgboost import XGBClassifier
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# feature importance
print(model.feature_importances_)
# plot
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)
pyplot.show()
Copy the code

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

Running the example first outputs the importance score.

0.089701    0.17109634  0.08139535  0.04651163  0.10465116  0.2026578 0.1627907   0.14119601]
Copy the code

We also get a bar chart of relative importance.

The drawback of this diagram is that elements are ordered by their input index rather than their importance. We can sort the features before drawing.

Thankfully, there is a built-in drawing function to help us.

The XGBoost library provides a built-in function to plot elements in order of importance. This function is called plot_importance () and can be used as follows:

# plot feature importance
plot_importance(model)
pyplot.show()
Copy the code

For example, here is the complete code listing that plots the importance of the characteristics of the Pima Indians dataset using the built-in plot_importance () function.

# plot feature importance using built-in function
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
Copy the code

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

Running the example will give us a more useful bar chart.

As you can see, elements are automatically named based on their indexes in the input array (X) of F0 through F7. Manually mapping these indexes to the names in the problem description shows that F5 (body mass index) has the highest importance and F3 (skin fold thickness) has the lowest importance.

Feature selection for XGBoost feature importance score

Feature importance score can be used for feature selection in SciKit-Learn. This is done by using the SelectFromModel class, which takes a model and transforms the dataset into subsets with selected elements. This class can take pre-trained models, such as models that are trained on the entire training data set. It can then use thresholds to determine which features to select. This threshold is used when you call the Transform () method on a SelectFromModel instance to consistently select the same elements on the training and test datasets.

In the following example, we first train and then evaluate the XGBoost model on the entire training and test datasets, respectively. Using the feature importance calculated from the training data set, the model is then wrapped in a SelectFromModel instance. We use it to select features on the training data set, train models from selected feature subsets, and then evaluate models on the test set and follow the same feature selection scheme.

Such as:

select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
Copy the code

For interest, we can test multiple thresholds to select features based on their importance. Specifically, the feature importance of each input variable, in essence, enables us to test each feature subset in order of importance, starting with all features and ending with the subset with the most important features.

The complete code listing is provided below:

# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 y_pred = selection_model.predict(select_X_test)
 predictions = [round(value) for value in y_pred]
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
Copy the code

Please note that if you are using XGBoost 1.0.2 (and possibly other versions), there is an error in the XGBClassifier class that causes an error:

KeyError: 'weight'
Copy the code

This can be solved by using a custom XGBClassifier class that returns None for the COEF_ property. The complete example is listed below.

# use feature importance for feature selection, with fix for xgboost 1.02.
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
 
# define custom class to fix bug in xgboost 1.02.
class MyXGBClassifier(XGBClassifier):
 @property
 def coef_(self):
  return None
 
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = MyXGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data and evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
 # select features using threshold
 selection = SelectFromModel(model, threshold=thresh, prefit=True)
 select_X_train = selection.transform(X_train)
 # train model
 selection_model = XGBClassifier()
 selection_model.fit(select_X_train, y_train)
 # eval model
 select_X_test = selection.transform(X_test)
 predictions = selection_model.predict(select_X_test)
 accuracy = accuracy_score(y_test, predictions)
 print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
Copy the code

Note: Your results may differ due to randomness of the algorithm or evaluation program, or due to differences in numerical accuracy. Consider running the example a few times and comparing the average results.

Running this example prints the following output.

Accuracy: 77.95%
Thresh=0.071, n=8, Accuracy: 77.95%
Thresh=0.073, n=7, Accuracy: 76.38%
Thresh=0.084, n=6, Accuracy: 77.56%
Thresh=0.090, n=5, Accuracy: 76.38%
Thresh=0.128, n=4, Accuracy: 76.38%
Thresh=0.160, n=3, Accuracy: 74.80%
Thresh=0.186, n=2, Accuracy: 71.65%
Thresh=0.208, n=1, Accuracy: 63.78%
Copy the code

We can see that the performance of the model generally degrades with the number of features selected.

In this case, to weigh the characteristics of the accuracy of the test set, we can decide to adopt a less complex model (with fewer attributes, such as n = 4) and accept a modest reduction in the accuracy of the estimate, from 77.95% to 76.38%.

This may be a baptism of fire for such a small data set, but for larger data sets and using cross-validation as a model evaluation scheme may be a more useful strategy.

Author: Yishui Hancheng, CSDN blogger expert, personal research interests: machine learning, deep learning, NLP, CV\

Blog: yishuihancheng.blog.csdn.net

Appreciate the author

Read more

Time series prediction with XGBoost \

5 minutes to master Python random hill-climbing algorithm \

5 minutes to fully understand association rule mining algorithm \

Special recommendation \

\

Click below to read the article and join the community