The decision tree
1 Basic Concepts
Amount of information: Measure the degree of uncertainty of an event. The higher the uncertainty, the greater the amount of information. Uncertainty is generally defined by the probability of event occurrence, and the amount of information is the log operation based on probability density function
Information entropy: measures the uncertainty degree of an event set, which is the uncertainty expectation of all events in the event set
Relative entropy (KL divergence) : Represents an asymmetric measure of the difference between two probability distributions. KL divergence can also be used from the point of view of information theory. KL divergence from this point of view can also be called relative entropy, which actually describes the difference of information entropy between two probability distributions
Cross entropy: Cross entropy is the sum of the information entropy of truth distribution and KL divergence. The entropy of truth distribution is determined and independent of the parameter θ of the model. Therefore, the optimization of cross entropy and optimization of KL divergence (relative entropy) are the same when gradient descent is derived
Joint entropy: it measures the information entropy of a new large event set formed after the combination of two event sets
Conditional entropy: Conditional entropy of event set Y = joint entropy – information entropy of event set X, which is used to measure the reduction of uncertainty of event set Y on the basis of known event set X
2 Principle of decision tree
2.1 the principle
Decision Tree is a non-parametric supervised learning method. It can summarize Decision rules from a series of data with features and labels and present these rules in the structure of Tree graph, which has solved classification and regression problems. The goal of decision tree is to establish a tree structure that can be used for decision making, including a root node and several leaf nodes and non-leaf nodes. The leaf node corresponds to each category, and the non-leaf node corresponds to an attribute value.
2.2 process
- Starting from the root node, the impurity of each feature is calculated, and the impurity with the highest purity is selected as the splitting value of the current node
- According to the value of the current split node, divide the data into two parts, and then establish the left and right subtrees respectively
- If the information gain before and after splitting does not exceed the threshold, or the current node belongs to the same category, the splitting stops
2.3 How to determine the optimal split point
-
ID3 – Information gain
The information gain before and after splitting is calculated and the maximum information gain is selected as the current best splitting feature.
For sample set D, the number of categories is K, and the empirical entropy of data set D is expressed as
Why do you take the logarithm in this formula? Why take a negative number? And actually, that’s what the definition says, you take the logarithm just to make it computationally easy, to convert multiplication to addition; The negative number is chosen because it does not include negative numbers, and the information entropy is negative, which does not accord with our cognition, that is, the higher the uncertainty, the greater the information entropy should be.
Including CkC_kCk is belong to the first sample set D k class sample subset, ∣ Ck ∣ | C_k | ∣ Ck ∣ said the number of elements in the subset, ∣ D ∣ | D | ∣ D ∣ said the number of elements in the sample collection.
And then calculate A feature A experience for A data set D conditional entropy H (D ∣ A) H (D | A) H (D ∣ A)
Where DiD_iDi represents the subset of samples whose feature A is the ith value in D, and DikD_{IK}Dik represents the subset of samples belonging to the KTH class in DiD_iDi.
Then the information gain g(D,A)g(D,A)g(D,A) can be expressed as the difference between the two, and can be obtained
Conclusion: ID3 has a preference for features with a large number of eigenvalues and can only process discrete data. Sensitive to missing values
-
C4.5 – Information gain rate
The information gain ratio of feature A to dataset D is defined as
Among them
Conclusion: C4.5 uses the information gain rate to overcome the feature with a large number of information gain preference features. In fact, it introduces the information entropy of features as the denominator to punish the high cardinal number categories. Therefore, C4.5 has a preference for features with fewer eigenvalues. It can process both discrete data and continuous data, and can be a multi-fork tree.
-
Cart-gini index
Gini describes the purity of data, similar to information entropy:
CART selects the feature with the smallest Gini index and its corresponding segmentation point for classification in each iteration. However, different from ID3 and C4.5, CART is A binary tree. Binary cutting method is adopted to cut the data into two pieces according to the value of feature A at each step and enter the left and right subtrees respectively. The Gini index of feature A is defined as
Conclusion: CART can be classified and regressive; It can process both discrete data and continuous data. But only binary trees, features that split can split again.
2.4 When does splitting stop?
- The information gain before and after splitting is less than the given threshold
- The number of samples of the current node is less than the given threshold
- The samples of the current node belong to the same category
- When the tree reaches a certain depth
2.5 How do I Prune?
- Pre-pruning: during training, if the prediction accuracy becomes smaller after splitting, splitting is abandoned
- Post-pruning: first train a complete tree, start from leaf nodes, calculate the accuracy rate; Remove the current node and calculate the accuracy rate again. If the accuracy rate improves after removing the current node, prune the branch
2.6 Advantages and disadvantages of decision tree
advantages
- The calculation is simple and easy to explain
- There is no need to normalize data
disadvantages
- A tree is easy to overfit
- Unstable, small data changes, will build different decision trees
- Greedy algorithm is used to guarantee local optimum, but not global optimum
3 Decision tree case
3.1 Build a decision tree
steps
- Instantiate, establish evaluation model object
- Train the model through model interface
- Obtain the required information through the model interface
Decision tree in SkLearn
- Eight parameters: Criterion, two random-dependent parameters (random_state, splitter), five pruning parameters (max_depth, min_samples_split, min_samples_leaf, max_feature, Min_impurity_decrease)
- One property: Feature_importances_
- Four interfaces: FIT, Score, apply and predict
code
def buildTree() :
Load the wine data set
wine = load_wine()
print(wine.data.shape)
print(wine.target)
# Organize wine into a table
pd.concat([pd.DataFrame(wine.data), pd.DataFrame(wine.target)], axis=1)
print(wine.feature_names)
print(wine.target_names)
# Divide training set and test set
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data, wine.target, test_size=0.3)
print(Xtrain.shape)
print(Xtest.shape)
# Build a model
# random_state is used to set the random mode parameter in the branch; Splitter is used to control random choices in the decision tree
# max_depth Limits the maximum depth of the tree; Min_samples_leaf & min_samples_split The minimum number of training samples for each child node after the split
clf = tree.DecisionTreeClassifier(criterion="entropy", random_state=30)
clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
print(score)
# Draw a tree
import graphviz
feature_name = wine.feature_names
dot_data = tree.export_graphviz(clf, feature_names=feature_name, class_names=['class_0'.'class_1'.'class_2'],
filled=True, rounded=True)
graph = graphviz.Source(dot_data)
print(graph)
# Explore the feature importance of decision trees
print(clf.feature_importances_)
feature_importance = [*zip(feature_name, clf.feature_importances_)]
print(feature_importance)
# apply () returns the index of the leaf node where each test sample resides
print(clf.apply(Xtest))
# predict () returns the classification result of each test sample: it returns a one-dimensional array of size N, and the ith value of the one-dimensional array is the label of the ith predicted sample
print(clf.predict(Xtest))
# predict_proba returns an array of n rows and K columns. The value on row I and column J is the probability that the model predicts the label J for the ith prediction sample. The sum of each row should be equal to 1
print(clf.predict_proba(Xtest))
Copy the code
3.2 Determine the optimal pruning parameters through grid search
def max_parameter(Xtrain, Ytrain, Xtest, Ytest) :
test = list(a)for i in range(10):
clf = tree.DecisionTreeClassifier(max_depth=i + 1, criterion="entropy", random_state=30, splitter="random")
clf = clf.fit(Xtrain, Ytrain)
score = clf.score(Xtest, Ytest)
test.append(score)
plt.plot(range(1.11), test, color="red", label="max_depth")
plt.legend()
plt.show()
Copy the code
3.3K fold cross validation
def cross_val() :
""" K-fold cross-validation :return: """
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
boston = load_boston()
regressor = DecisionTreeRegressor(random_state=0)
val_score = cross_val_score(regressor, boston.data, boston.target, cv=10, scoring="neg_mean_squared_error")
print(val_score)
Copy the code
3.4 Prediction of Titanic survivors
def titanic_example() :
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import numpy as np
data = pd.read_csv(r"data.csv", index_col=0)
print(data.head())
print(data.info())
# delete columns with too many missing values, and columns that are not related to the predicted y from observation judgment
data.drop(["Cabin"."Name"."Ticket"], inplace=True, axis=1)
# Handle missing values, fill in the columns with more missing values, some features only have one or two values, you can directly delete records
data["Age"] = data["Age"].fillna(data["Age"].mean())
data = data.dropna()
# Convert binary variables to numeric variables
Unlike apply(int(x)), astype converts text classes to numbers. This method can be used to easily convert binary features to 0~1
data["Sex"] = (data["Sex"] = ="male").astype("int")
# Convert tripartite variables to numeric variables
labels = data["Embarked"].unique().tolist()
data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))
# View the processed data set
print(data.head())
Extract tag and feature matrix, divided into test set and training set
X = data.iloc[:, data.columns != "Survived"]
y = data.iloc[:, data.columns == "Survived"]
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3)
# Fix indexes for test sets and training sets
for i in [Xtrain, Xtest, Ytrain, Ytest]:
i.index = range(i.shape[0])
# View the training set and test set
print(Xtrain.head())
# Build a model
clf = DecisionTreeClassifier(random_state=25)
clf = clf.fit(Xtrain, Ytrain)
score_ = clf.score(Xtest, Ytest)
print(score_)
score = cross_val_score(clf, X, y, cv=10).mean()
print(score)
# Observe the model fitting status at different MAX_depths
tr = []
te = []
for i in range(10):
clf = DecisionTreeClassifier(random_state=25
, max_depth=i + 1
, criterion="entropy"
)
clf = clf.fit(Xtrain, Ytrain)
score_tr = clf.score(Xtrain, Ytrain)
score_te = cross_val_score(clf, X, y, cv=10).mean()
tr.append(score_tr)
te.append(score_te)
print(max(te))
plt.plot(range(1.11), tr, color="red", label="train")
plt.plot(range(1.11), te, color="blue", label="test")
plt.xticks(range(1.11))
plt.legend()
plt.show()
Adjust parameters with grid search
gini_thresholds = np.linspace(0.0.5.20)
parameters = {'splitter': ('best'.'random'),'criterion': ("gini"."entropy"),"max_depth": [*range(1.10)].'min_samples_leaf': [*range(1.50.5)].'min_impurity_decrease': [*np.linspace(0.0.5.20)]}
clf = DecisionTreeClassifier(random_state=25)
GS = GridSearchCV(clf, parameters, cv=10)
GS.fit(Xtrain, Ytrain)
print(GS.best_params_)
print(GS.best_score_)
Copy the code