4.7 Sklearn summary of decision tree
Reference documents: Chinese link English link API: Chinese link English link
The internal implementation of sciKit-learn decision tree algorithm library uses the optimized CART tree algorithm, which can do both classification and regression. The classification decision tree class corresponds to the DecisionTreeClassifier. The sklearn.tree module provides a decision tree model for solving classification and regression problems. The method is as follows:
DecisionTreeClassifier let’s take a look at the DecisionTreeClassifier function, which has 12 arguments:
Class sklearn. Tree. DecisionTreeClassifier (criterion = gini, splitter = 'best', max_depth = None, min_samples_split = 2, Min_samples_leaf =1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, Min_impurity_decrease = 0.0, min_impurity_split = None, class_weight = None, presort = False)Copy the code
The parameters are described as follows:
- Criterion: Optional parameter, default gini, optionally set to entropy. Gini is gini impurity, which is the expected error rate of randomly applying a certain result from a set to a data item. Gini is an idea based on statistics. Entropy is shannon’s entropy, as discussed in the previous article, and is an idea based on information theory. Sklearn set gini as the default parameter. The ID3 algorithm uses entropy and the CART algorithm uses GINi.
- Splitter: feature division point selection criteria. This parameter is optional. The default value is best and the value can be set to random. The selection strategy for each node. The best parameter selects the best shard feature according to the algorithm, such as gini and entropy. Random Find the locally optimal partition point among the partial partition points. The default “best” is suitable for small sample size, while if the sample data is very large, “random” is recommended for decision tree construction.
- Max_features: Maximum number of features to be considered when dividing. This parameter is optional. The default value is None. In order to find the maximum number of features (n_features is the total number of features), there are 6 cases as follows: If max_features is an integer number, max_features are considered; If max_features is a floating-point number, int(max_features * n_features) is considered; If max_features is set to auto, then max_features = SQRT (n_features); If max_features is set to SQRT, then max_featrues = SQRT (n_features), as with auto; If max_features is set to log2, then max_features = log2(n_features); If max_features is set to None, then max_features = n_features, that is, all features are used. Generally speaking, if the number of features in the sample is small, such as less than 50, we use the default “None”. If the number of features is very large, we can flexibly use the other values described just now to control the maximum number of features considered during partitioning, so as to control the generation time of the decision tree.
- Max_depth: specifies the maximum depth of the decision tree. This parameter is optional. The default value is None. This parameter is this is the number of layers of the tree. The idea of a number of levels, for example in the loan case, is that the decision tree has 2 levels. If this parameter is set to None, the decision tree does not limit the depth of the subtree when it is created. In general, you can leave this value alone when there is little data or few features. Or if the min_SAMples_SLIpt parameter is set, until there are fewer samples than min_sMAples_split. If the model sample size is large and features are large, it is recommended to limit the maximum depth. The specific value depends on the distribution of data. The value ranges from 10 to 100.
- Min_samples_split: minimum number of samples required for internal node redivision. This parameter is optional. The default value is 2. This value limits the conditions under which subtrees can be further divided. If min_samples_split is an integer, then min_samples_split is the smallest number of samples when splitting the internal nodes. That is, if there are fewer samples than min_samples_split, the split is stopped. If min_samples_split is a floating point number, then min_samples_split is a percentage, and ceil(min_samples_split * n_samples) is rounded up. If the sample size is small, you don’t need to worry about this value. If the sample size is very large, it is recommended to increase this value.
- Min_weight_fraction_leaf: the minimum sample weight sum of a leaf node. This parameter is optional. The default value is 0. This value limits the minimum sum of all sample weights of leaf nodes. If it is less than this value, it will be pruned along with its sibling nodes. Generally speaking, if we have a large number of samples with missing values, or the distribution of classification tree samples has a large category deviation, sample weight will be introduced, then we should pay attention to this value.
- Max_leaf_nodes: specifies the maximum number of leaf nodes. This parameter is optional. Overfitting can be prevented by limiting the maximum number of leaf nodes. If the limit is added, the algorithm will build the optimal decision tree within the maximum number of leaves. If there are few features, this value can be ignored, but if there are many features, it can be restricted, and the exact value can be obtained by cross-validation.
- Class_weight: indicates the category weight. This parameter is optional. The default value is None. The weight of each category of the sample is specified mainly to prevent too many samples of some categories in the training set, which will lead to the training decision tree favoring these categories too much. The weight of the category can be given by {class_label: weight} format, which can specify the weight of each sample or use balanced. If balanced is used, the algorithm will calculate the weight by itself, and the sample weight of the category with a small sample size will be higher. Of course, if your sample category distribution is not significantly biased, you can ignore this parameter and choose the default None.
- Random_state: Optional parameter, default is None. Random number seed. If it is a certificate, random_state is used as the random number seed for the random number generator. Seed of random number, if no random number is set, the random number is related to the current system time, each time is different. If a random number seed is set, the same random number seed will generate the same random number at different times. If it is a RandomState instance, then random_state is a random number generator. If None, the random number generator uses NP.random.
- Min_impurity_split: minimum split of nodes. This parameter is optional. The default value is 1E-7. This is a threshold that limits the growth of the decision tree. If a node’s impurity (Gini coefficient, information gain, mean square error, absolute difference) is less than this threshold, the node will not generate child nodes. Is the leaf node.
- Presort: Indicates whether the data is presorted. This parameter is optional. The default value is False. In general, if the sample size is small or a decision tree with a small depth is limited, setting true can speed up the selection of partition points and the establishment of the decision tree. If the sample size is too large, there is no benefit.
In addition to these parameters, other points to note when tuning parameters are:
- When the sample number is small but the sample features are very large, the decision tree is easy to overfit. Generally speaking, it is easier to build a robust model if the sample number is larger than the feature number
- If the number of samples is small but the sample features are very large, it is recommended to do dimension specifications, such as principal component analysis (PCA), feature selection (Losso) or independent component analysis (ICA), before fitting the decision tree model. The dimension of the feature would be greatly reduced. It would be good to fit the decision tree model again.
- It is recommended to use the visualization of decision tree and limit the depth of decision tree at the same time, so that the preliminary fitting of data in the generated decision tree can be observed before deciding whether to increase the depth. – When training the model, pay attention to the category of samples (mainly referring to the classification tree). If the category distribution is very uneven, consider using class_weight to limit the model to be too biased towards the category with many samples. – The array of the decision tree uses numPY float32. If the training data is not in this format, the algorithm copies the data before running.
- If the input sample matrix is sparse, it is recommended to call CSC_matrix sparsity before fitting and cSR_matrix sparsity before prediction.
4.8 Summary of decision tree visual environment construction and use
Visualization of decision trees in SciKit-Learn typically requires graphviz installed. The first step is to install Graphviz. The author used the Anaconda integration environment. Just type conda install python-Graphviz pip3 install Graphviz in the terminal window
I use two approaches. Method 1: Use graphviz’s dot command to generate a visual file of the decision tree. After typing this command, you can see the visual file of the decision tree iris.pdf in the current directory. Open to see the model diagram of the decision tree. You need to add the following code to your code.
with open("iris.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f)
Copy the code
After the file is successfully run, infor_gain. dot is generated in the directory. The dot file needs to be converted to the PDF visual decision tree. dot -Tpdf iris.dot -o output.pdf
Method 2: PDF can be generated directly
"" dot_data = tree.export_graphviz(CLF, out_file=None) "" feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph.render("tree")Copy the code
DT_Iris_Visual-sklearn\ d6.dt_iris_visual -sklearn.py
This chapter refers to the code click enter