In this article, we will learn to implement a classification decision tree using Python’s Sklearn framework. The data set used in this article is the built-in breast cancer classification data set of Sklearn.
1. Base implementation
from sklearn.tree import DecisionTreeClassifier as dtc
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
Copy the code
cancer=datasets.load_breast_cancer()
print(cancer.keys())
Copy the code
The dataset is stored as a dictionary, where DATA represents the feature portion of the data, target represents the tag of the data, target_NAMES represents the actual meaning of the tag (1/0), and feature_names represents the names of features
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
Copy the code
We use default parameters for model training.
clf=dtc()Use the default parameters
clf.fit(cancer['data'],cancer['target'])
Copy the code
DecisionTreeClassifier()
Copy the code
Using the built-in drawing method in Sklearn for visualization, as shown below, the image does not look very good.
from sklearn import tree
plt.figure(figsize=(20.12))
tree.plot_tree(clf)
plt.show()
Copy the code
So you use the graphviz package. When you download the package, you not only install it in Python using PIP, but also download the installation package from the official website and copy the specific installation path into the environment variable. You can download the installation path from other Baidu blogs, but I won’t go into details here.
import graphvizPackage used to draw the decision tree
xx=tree.export_graphviz(clf,
out_file=None,
feature_names=cancer['feature_names'].Add feature names
class_names=cancer['target_names'].# category name
filled=True.# Fill color
rounded=False.# Rounded edges or not
)
graph1 = graphviz.Source(xx)
#graph.render('111')# export to PDF
Copy the code
graph1
Copy the code
2. Parameter learning
In this part, we study some common parameters of classification decision tree.
def graph(clf) :
clf.fit(cancer['data'],cancer['target'])
xx=tree.export_graphviz(clf,
out_file=None,
feature_names=cancer['feature_names'],
class_names=cancer['target_names'],
filled=True.# Fill color
rounded=False.# Rounded edges or not
)
graph1 = graphviz.Source(xx)
return graph1
Copy the code
2.1 Splitting Index
Criterion, optionally “gini” and “entropy”, refers to gini coefficient and information gain. According to Sklearn’s official website, the default decision tree is the CART algorithm, which uses gini coefficient. The metric used by ID3 is the information gain. , C4.5 uses the information gain rate, which is not provided here.
It can be seen that the figure below is different from the figure above with Gini coefficient, which may be due to different splitting indexes, so are the splitting points and splitting characteristics.
clf2=dtc(criterion='entropy')# Use information gain
graph2 = graph(clf2)
graph2
Copy the code
2.2 Splitting Mode
The decision tree has two splitting splitter modes, “Best” and “random” respectively. The default value is best
Best means that the optimal split point is selected every time. Although the decision tree is random in branching, it still preferentially selects more important features for branching. Random means that the split point is randomly selected every time.
lran=[]
lbes=[]
for i in range(100):
clf3=dtc(splitter="random")
clf4=dtc(splitter="best")
random=cross_val_score(clf3, cancer['data'], cancer['target'], cv=5)
best=cross_val_score(clf4, cancer['data'], cancer['target'], cv=5)
lran.append(random.mean())
lbes.append(best.mean())
Copy the code
plt.figure(figsize=(16.6))
plt.plot(lran,label='random',c='darkblue')
plt.axhline(np.mean(lran),c='green')
plt.plot(lbes,label='best',c='darkred')
plt.axhline(np.mean(lbes),c='orange')
plt.legend()
plt.grid()
plt.show()
Copy the code
As can be seen from the figure above, the instability of using Random as the splitting point is higher than that of using BEST.
clf3_1=dtc(splitter="random",random_state=3)
clf4_1=dtc(splitter="best",random_state=3)
Copy the code
graph3_1 = graph(clf3_1)
graph3_1
Copy the code
graph4_1 = graph(clf4_1)
graph4_1
Copy the code
From the above two decision trees, it can be seen that the decision tree formed by using RANDOM as the splitting mode is deeper and larger.
2.3 Maximum depth
Max_depth: int. There is no default value, indicating the maximum depth of the tree. If it is None, the nodes of the decision tree will split until all leaf nodes have fewer samples than min_samples_split. I think this parameter is a rough way to prune, once the split reaches the maximum depth, the decision tree splits the speed of light to stop.
lmd=[]
for i in range(1.10):
ll=0
for j in range(100):
clf5=dtc(max_depth=i)
ll+=cross_val_score(clf5, cancer['data'], cancer['target'], cv=5).mean()
lmd.append(ll/100)
Copy the code
plt.plot(lmd,marker='o',c='darkblue',mfc='orange')
plt.show()
Copy the code
clf5=dtc(max_depth=3)
graph5= graph(clf5)
graph5
Copy the code
2.4 Minimum number of split samples
Min_samples_split: The minimum number of samples required to split internal nodes, integer value or floating point value. The default value is 2. If it is integer value N, it means that the minimum number of samples in leaf nodes that can be split is N; if it is floating point value P, it means that the minimum number of samples in leaf nodes is n_samples*p.
In the case of an integer value, this parameter means that when the number of samples on a node is greater than or equal to min_samples_split, the split condition is met.
lms=[]
for i in range(3.50.3):
ll=0
for j in range(100):
clf6=dtc(min_samples_split=i)
ll+=cross_val_score(clf6, cancer['data'], cancer['target'], cv=5).mean()
lms.append(ll/100)
Copy the code
plt.plot([i for i in range(3.50.3)],lms,marker='8',mfc='pink')
plt.show()
Copy the code
clf6=dtc(min_samples_split=20)
graph6= graph(clf6)
graph6
Copy the code
2.5 Minimum number of leaf node samples
min_samples_leaf
Int or float, default is 1
Minimum number of samples required on leaf nodes. The segmentation point of any depth will only be considered if at least training samples are left in each left and right branch. This can have the effect of smoothing the model, especially in regression.
- If it is int, it is treated as the minimum =
min_samples_leaf
- If it is a floating point p, it is a fraction and is the minimum sample number of each node =
min_samples_leaf=(p* n_samples)
lml=[]
for i in range(3.50.3):
ll=0
for j in range(100):
clf7=dtc(min_samples_leaf=i)
ll+=cross_val_score(clf7, cancer['data'], cancer['target'], cv=5).mean()
lml.append(ll/100)
Copy the code
plt.plot([i for i in range(3.50.3)],lml,marker='o',mfc='lightgreen')
plt.show()
Copy the code
clf7=dtc(min_samples_leaf=20)
graph7= graph(clf7)
graph7
Copy the code
2.6 Maximum number of features
Max_features: Number of features to consider when looking for the optimal split point.
Int: If the integer n is entered, n is used.
Float: the number of features used is n_features*p if p is entered.
Auto: max_features = SQRT (n_features).
SQRT: max_features = SQRT (n_features).
Log2: max_features = log2 (n_features).
None: max_features = n_features.
I think what this parameter controls is the number of samples that are used to select split features at a time. If the parameter is set to 9 each time, a feature is selected from one of the nine features each time the split is performed.
lmscore=[]
for i in range(1.31):
ll=0
for j in range(100):
clf8=dtc(max_features=i)
ll+=cross_val_score(clf8, cancer['data'], cancer['target'], cv=5).mean()
lmscore.append(ll/100)
Copy the code
plt.plot([i for i in range(1.31)],lmscore,marker='o',mfc='green')
plt.show()
Copy the code
dep=[]
for i in range(1.31):
ll=0
for j in range(100):
clf8=dtc(max_features=i)
clf8.fit(cancer['data'], cancer['target'])
ll+=clf8.get_depth()# Get depth
dep.append(ll/100)
Copy the code
plt.plot([i for i in range(1.31)],dep,marker='o',mfc='green')
plt.title("depth")
plt.show()
Copy the code
It can be seen that the depth of decision tree is related to max_features and generally negatively correlated.
2.7 Category Weight
Note that for multiple outputs (including multiple labels), you should define weights for each class of each column in its own dictionary. For example, for the four multi-label classifications, the weight should be [{0:1, 1:1}, {0:1, 1:5}, {0:1, 1:1}, {0:1, 1:1}] rather than [{1:1}, {2:5}, {3:1}, {4:1}].
print("The number of ones".sum(cancer['target']))
print("Number of zeros".len(cancer['target']) -sum(cancer['target']))
Copy the code
The number of ones 357 the number of zeros 212Copy the code
First, it can be seen that the dataset is not balanced. We change the weights of each category and calculate the average score. It can be seen that the verification score is highest when the weight of 1 is about 0.4.
lmcw=[]
for i in range(1.10):
ll=0
for j in range(100):
clf9=dtc(class_weight={1:i/10.0: (1-i/10)})
ll+=cross_val_score(clf9, cancer['data'], cancer['target'], cv=5).mean()
lmcw.append(ll/100)
Copy the code
plt.plot([i/10 for i in range(1.10)],lmcw,marker='o',mfc='red')
plt.show()
Copy the code
2.8 Other Parameters
Min_impurity_decrease: floating point. The default value is 0. The node will be split if the reduction of impurities caused by splitting is greater than or equal to this value.
Max_leaf_nodes: an integer. The default value is 0. The max_leaf_nodes spanning tree is used in the best-priority manner. The optimal node is defined as the relative reduction of impurities. None limits the number of leaves.
Min_weight_fraction_leaf: floating point type. Default is 0.0. A sample of a leaf node has the minimum weight of all input samples, and each sample has the same weight when sample_weight is not given.
Random_state: int, RandomState instance or None, default=None, control the randomness of the estimator. Features are always randomly arranged on each split, even if the Splitter is set to “best”. When max_features < n_features, the algorithm will randomly select max_features at each segmentation and then find the best segmentation between them. However, even if max_features=n_features, the best split found on different runs may differ. This is the case if the improvement criteria for several partitions are the same and one must be chosen at random. In order to obtain deterministic behavior in the fitting process, the random state must be fixed to an integer.
3. Properties and methods
3.1 attributes
Classes_ : NDARray of Shape (n_classes,) or list of NDARray, class tags (single output problem), or list of class tag arrays (multiple output problem).
** Feature_importances_ : Ndarray of Shape (n_features,)** Class tags (single-output problems), or a list of class tag arrays (multi-output problems).
Max_features_ : The inferred value of max_features.
**n_classes_**int or list of int: The number of classes (for a single output problem), or a list containing the number of classes per output (for multiple output problems).
3.2 methods
Bold is a common method
Decision_path (X[, check_INPUT]) returns the decision process for the decision tree.
fit
(X, y[, sample_weight, check_input, …] ) Train on the training set.
Get_depth () returns the decision tree depth
Get_n_leaves () returns the number of leaves in the decision tree
Get_params ([deep]) returns the parameters of the classifier
predict(X[, check_input])
Return the result of classification or regression (regression problem) of the test sample
Predict_log_proba (X) returns the classified logarithmic probability of the test sample.
predict_proba(X[, check_input])
Returns the classification probability of the test sample.
Score (X, y[, sample_weight]) returns the average of the test sample
Refer to the link
Sklearn. Tree. DecisionTreeClassifier – scikit – learn 1.0.2 documentation
1.10. Decision Trees — scikit-learn 1.0.2 documentation