So far, WE have completed support vector machine SVM, decision tree, KNN, Bayesian, linear regression and Logistic regression. For other algorithms, please allow Taoye to give credit here for the first time. Later, we will have the opportunity and time to make up for you.

Update so far, also received part of the reader’s praise. It’s not much, but thank you very much for your support, and I hope everyone who reads it will find it rewarding.

The entire content of this series of articles is Taoye pure hand written, and also references a number of books and open sources. The total number of words in this series is around 15W (including source code), which will be gradually filled in later. More technical articles can be found on Taoye’s official account: Cynical Coder. The document can be circulated freely, but be careful not to modify its contents.

If you have any questions you don’t understand in the article, you can directly ask them, and Taoye will reply as soon as you see them. Meanwhile, you are welcome to come here to privately urge Taoye: Cynical Coder. Taoye’s personal contact information is also available on the public account. There are some things Taoye can only secretly say to you there (# ‘O’)

To improve your reading experience, Taoye, a series of hand-ripping machine learning articles, has been compiled into PDF and HTML, and is available for free by replying to [666] on the public id [Cynical Coder].

In my previous article, Machine Learning in Action, Taoye gave you a brief introduction to the theory of decision trees, what decision trees are, and the three decision criteria that need to be selected in order to generate a decision tree. Having studied SVM or read several SVM articles written by Taoye before, we can find that decision tree is much simpler than SVM, without too many complicated formula derivation.

Let’s review a little bit:

  • Attributes of information gain is higher, in principle, should be first selection, commonly used in ID3ID3ID3 algorithm
  • The higher the gain rate of attribute features is, the preference should be given to them, which is commonly used with c4.5C4.5C4.5 algorithm
  • The Nicky index of attribute characteristics is low, which should be selected in preference, usually used in the CartCART algorithm

With that in mind, this article should be a breeze to read. This article mainly covers the following parts:

  • The decision tree was manually constructed based on ID3 algorithm and visualized by Matplotlib
  • Based on the constructed decision tree classification prediction
  • How should the constructed decision tree model be saved and read?
  • Using Sklearn to build a decision tree from Iris data sets,

I. Manually build decision tree based on ID3 algorithm and visualize it through Matplotlib

There are several algorithms to build decision trees, such as ID3, C4.5, CART, etc. Due to the limitations of space, time and energy, this paper mainly adopts ID3 algorithm to build decision trees, and the decision standard (index) used is the information gain mentioned in the previous article. For C4.5 and CART algorithm building decision trees, interested readers can refer to the gain rate and Nicky index in the previous issue.

The data set used for this decision tree construction is still the loan data from Statistical Learning Methods by Li Hang. Here is the data set again:

As mentioned above, ID3 algorithm is mainly based on information gain as the criterion for selecting attribute features. In the previous period, we also calculated the information gain values corresponding to each attribute feature as follows:


G a i n ( D . age ) = 0.083 G a i n ( D . work ) = 0.324 G a i n ( D . house ) = 0.420 G a i n ( D . credit ) = 0.363 \ begin} {aligned & Gain (D, age) = 0.083 \ \ & Gain (D) = 0.324 \ \ & Gain (D, house) = 0.420 \ \ & Gain (D, s) = 0.363 \ end} {aligned

According to the PROCESS of ID3 algorithm, we can know that the attribute with the maximum information gain should be selected first for decision-making, namely, the house, which is to take the house as the root node of the decision tree. Due to house the one attribute characteristics, there are two values, so lead to two child nodes, a corresponding “house” and another corresponding “no house”, and “house” six samples of categories are allowed to loan, also is to have the same kind of sample points, so it should become a leaf node, and the node class might as well labeled “allow”.

Thus, our root node and one of its leaves are determined. Next, we need to select a new attribute feature from the sample set corresponding to “no house” to make a decision. Note: The data samples used to make the decision are no longer the initial data, but all the samples corresponding to “no house”, which needs special attention.

We calculate the information gain corresponding to other attributes in the data sample of “no house” again. We might as well say that the data sample set of this time is denoted as D1D_1D1, and the calculation results are as follows:


G a i n ( D 1 . age ) = 0.251 G a i n ( D 1 . work ) = 0.918 G a i n ( D 1 . credit ) = 0.474 \ begin} {aligned & Gain (D_1, age) = 0.251 \ \ & Gain (D_1, work) = 0.918 \ \ & Gain (D_1, credit) = 0.474 \ end} {aligned

By the same analysis as above, it can be found that the information gain value corresponding to work is the highest at this time, that is to say, the second priority attribute is “work”. In addition, we can find that in the data sample set D2D_2D2, there are 9 in total, among which 3 are allowed to lend and 6 are rejected, and the result labels are exactly corresponding to the working values, that is to say, those with jobs are allowed to lend and those without jobs are rejected. Therefore, after the selection of the second attribute feature is completed, two leaf nodes are generated at this time, and the node result corresponds to whether there is work.

Through the above analysis, we can get the decision tree constructed based on ID3 algorithm, which is as follows:

Next, we generate this decision tree in code. For the data in the tree structure, we can save it in dictionary or JSON type. For example, the decision tree in the figure above can be represented by the following results:

{" house ": {" 1" : "Y", "0" : {" work ": {" 1" : "Y", "0" : "N"}}}}Copy the code

The data format represented above is commonly referred to as Json, which is used extensively in front and back ends, crawlers, and various other areas. In addition, we can find that in the process of decision tree generation, after the selection of an attribute feature is completed, we need to go through the same operation to select another attribute feature, which is actually equivalent to a cycle. In other words, it just meets the characteristics of recursion, but our data changes in general. Now that we know what types we need to save tree-structured data, let’s do it in code:

Compared with the calculation of information gain in the previous article, there are three main changes in this code:

  • Because we ultimately need to generate the content of the property, not just the index, when we create data, we need to return the corresponding property, modify, in addition to the data itselfestablish_dataThe method is as follows:
"" Author: Taoye wechat public number: Coder Explain: "" def establish_data(): Data = [[0, 0, 0, 0, 'N'], # The last item indicates whether or not the loan [0, 0, 0, 1, 'N'], [0, 1, 0, 1, 'Y'], [0, 1, 1, 0, 'Y'], [0, 0, 0, 0, 'N'], [1, 0, 0, 0, 'N'], [1, 0, 0, 1, 'N'], [1, 1, 1, 1, 'Y'], [1, 0, 1, 2, 'Y'], [1, 0, 1, 2, 'Y'], [2, 0, 1, 2, 'Y'], [2, 0, 1, 1, 'Y'], [2, 1, 0, 1, 'Y'], (2, 1, 0, 2, 'Y'], [2, 0, 0, 0, 'N']] labels = [" age ", "work", "house", "credit"] return np. The array (data), and labelsCopy the code
  • In the last article, ourhandle_dataThe method is only to find the samples of the corresponding attribute eigenvalues, for example, find all the sample data sets of young people. To build a decision tree, the attribute should be removed from the data set after the first selection, so changehandle_dataAs follows:
Def handle_data(data, axis, value): def handle_data(data, axis, value) result_data = list() for item in data: if item[axis] == value: reduced_data = item[: axis].tolist() reduced_data.extend(item[axis + 1:]) result_data.append(reduced_data) return result_dataCopy the code
  • The third is that you need a way to create a decision treeestablish_decision_tree, the specific code is as follows, in which a recursive call itself is needed to select the second attribute after the completion of the selection
Author: Taoye 标 签 : def establish_decision_tree(data, labels, feat_labels): cat_list = [item[-1] for item in data] if (cat_list.count(cat_list[0]) == len(cat_list)): Return cat_list[0] # There is only one category in the dataset best_feature_index = calc_information_gain(data) # Select the best attribute attribute best_label = by information gain first Labels [best_feature_index] # Feat_labels indicates the selected attributes. Create a decision tree node; Remove the selected attribute feat_allages.append (best_label) from the attribute label list; decision_tree = {best_label: dict()}; del(labels[best_feature_index]) feature_values = [item[best_feature_index] for item in data] unique_values = Set (feature_values) # set(feature_values) sub_label = labels[:] decision_tree[best_label][value] = establish_decision_tree(np.array(handle_data(data, best_feature_index, value)), sub_label, feat_labels) return decision_treeCopy the code

The complete code for this section is shown below:

Import numpy as NP import pandas as pd np.__version__ pd.__version__ """ Author: Taoye Coder Explain: "" def establish_data(): Data = [[0, 0, 0, 0, 'N'], # The last item indicates whether or not the loan [0, 0, 0, 1, 'N'], [0, 1, 0, 1, 'Y'], [0, 1, 1, 0, 'Y'], [0, 0, 0, 0, 'N'], [1, 0, 0, 0, 'N'], [1, 0, 0, 1, 'N'], [1, 1, 1, 1, 'Y'], [1, 0, 1, 2, 'Y'], [1, 0, 1, 2, 'Y'], [2, 0, 1, 2, 'Y'], [2, 0, 1, 1, 'Y'], [2, 1, 0, 1, 'Y'], (2, 1, 0, 2, 'Y'], [2, 0, 0, 0, 'N']] labels = [" age ", "work", "house", "credit"] return np. The array (data), the labels "" "the Author: Coder Explain: Calculate information entropy "" def calc_information_entropy(data): data_number, _ = data.shape information_entropy = 0 for item in pd.DataFrame(data).groupby(_ - 1): proportion = item[1].shape[0] / data_number information_entropy += - proportion * np.log2(proportion) return Information_entropy """ Author: Taoye Def handle_data(data, axis, value) def handle_data(data, axis, value): result_data = list() for item in data: if item[axis] == value: reduced_data = item[: axis].tolist() reduced_data.extend(item[axis + 1:]) result_data.append(reduced_data) return result_data """ Author: Def calc_information_gain(data): def calc_information_gain(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the information entropy of the total data set Max_information_gain, best_feature = 0.0, -1 # Initialize maximum information gain and corresponding feature index for index in range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(feat_list) new_entropy = 0.0 for set_item in feat_set: Sub_data = handle_data(data, index, Proportion = len(sub_data)/float(data.shape[0]) # new_entropy += proportion * Calc_information_entropy (NP.array (sub_data)) temp_information_gain = base_entropy - new_entropy # Computes information gain Print (" %d ") print(" %d ") print(" %d ") Temp_information_gain (temp_information_gain > max_information_gain)) Max_information_gain, best_feature = temp_information_gain, index Return best_feature "" Author: Taoye Def establish_decision_tree(data, labels, feat_labels): cat_list = [item[-1] for item in data] if (cat_list.count(cat_list[0]) == len(cat_list)): Return cat_list[0] # There is only one category in the dataset best_feature_index = calc_information_gain(data) # Select the best attribute attribute best_label = by information gain first Labels [best_feature_index] # Feat_labels indicates the selected attributes. Create a decision tree node; Remove the selected attribute feat_allages.append (best_label) from the attribute label list; decision_tree = {best_label: dict()}; del(labels[best_feature_index]) feature_values = [item[best_feature_index] for item in data] unique_values = Set (feature_values) # set(feature_values) sub_label = labels[:] decision_tree[best_label][value] = establish_decision_tree(np.array(handle_data(data, best_feature_index, value)), sub_label, feat_labels) return decision_tree if __name__ == "__main__": data, labels = establish_data() print(establish_decision_tree(data, labels, list()))Copy the code

Running results:

{' house ': {' 1', 'Y', '0' : {' work: {' 1 ', 'Y', '0', 'N'}}}}Copy the code

As you can see, the code runs exactly like the decision tree we created manually. Perfect

However, the above decision tree display is a bit too unpopular, the generated decision tree is relatively simple that is better, if we generate a decision tree is more complex, it is a bit confusing to output the decision tree through Json format data.

To do this, we need to visualize the decision tree, mainly using the Matplotlib package. You can refer to the documentation and other resources for the use of Matplotlib, and Taoye will also compile a common interface of its own later.

import numpy as np
import pandas as pd

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:创建训数据集
"""
def establish_data():
    data = [[0, 0, 0, 0, 'N'],         # 样本数据集相关信息,前面几项代表属性特征,最后一项表示是否放款
            [0, 0, 0, 1, 'N'],
            [0, 1, 0, 1, 'Y'],
            [0, 1, 1, 0, 'Y'],
            [0, 0, 0, 0, 'N'],
            [1, 0, 0, 0, 'N'],
            [1, 0, 0, 1, 'N'],
            [1, 1, 1, 1, 'Y'],
            [1, 0, 1, 2, 'Y'],
            [1, 0, 1, 2, 'Y'],
            [2, 0, 1, 2, 'Y'],
            [2, 0, 1, 1, 'Y'],
            [2, 1, 0, 1, 'Y'],
            [2, 1, 0, 2, 'Y'],
            [2, 0, 0, 0, 'N']]
    labels = ["年纪", "工作", "房子", "信用"]
    return np.array(data), labels

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:计算信息熵
"""
def calc_information_entropy(data):
    data_number, _ = data.shape
    information_entropy = 0
    for item in pd.DataFrame(data).groupby(_ - 1):
        proportion = item[1].shape[0] / data_number
        information_entropy += - proportion * np.log2(proportion)
    return information_entropy

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:找出对应属性特征值的样本,比如找出所有年纪为青年的样本数据集
"""
def handle_data(data, axis, value):
    result_data = list()
    for item in data:
        if item[axis] == value:
            reduced_data = item[: axis].tolist()
            reduced_data.extend(item[axis + 1:])
            result_data.append(reduced_data)
    return result_data

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:计算最大的信息增益,并输出其所对应的特征索引
"""
def calc_information_gain(data):
    feature_number = data.shape[1] - 1                    # 属性特征的数量
    base_entropy = calc_information_entropy(data)                 # 计算总体数据集的信息熵
    max_information_gain, best_feature = 0.0, -1                 # 初始化最大信息增益和对应的特征索引
    for index in range(feature_number):
        feat_list = [item[index] for item in data]
        feat_set = set(feat_list)
        new_entropy = 0.0
        for set_item in feat_set:                         # 计算属性特征划分后的信息增益
            sub_data = handle_data(data, index, set_item)
            proportion = len(sub_data) / float(data.shape[0])           # 计算子集的比例
            new_entropy += proportion * calc_information_entropy(np.array(sub_data))
        temp_information_gain = base_entropy - new_entropy                     # 计算信息增益
        print("第%d个属性特征所对应的的增益为%.3f" % (index + 1, temp_information_gain))            # 输出每个特征的信息增益
        if (temp_information_gain > max_information_gain):
            max_information_gain, best_feature = temp_information_gain, index       # 更新信息增益,确定的最大的信息增益对应的索引
    return best_feature

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:创建决策树
"""
def establish_decision_tree(data, labels, feat_labels):
    cat_list = [item[-1] for item in data]
    if (cat_list.count(cat_list[0]) == len(cat_list)): return cat_list[0]   # 数据集中的类别只有一种
    best_feature_index = calc_information_gain(data)    # 通过信息增益优先选取最好的属性特征
    best_label = labels[best_feature_index]   # 属性特征对应的标签内容
    # feat_labels表示已选取的属性;新建一个决策树节点;将属性标签列表中删除已选取的属性
    feat_labels.append(best_label); decision_tree = {best_label: dict()}; del(labels[best_feature_index])
    feature_values = [item[best_feature_index] for item in data]
    unique_values = set(feature_values)      # 获取最优属性对应值的set集合
    for value in unique_values:
        sub_label = labels[:]
        decision_tree[best_label][value] = establish_decision_tree(np.array(handle_data(data, best_feature_index, value)), sub_label, feat_labels)
    return decision_tree
"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:统计决策树当中的叶子节点数目,以及决策树的深度
"""
def get_leaf_number_and_tree_depth(decision_tree):
    leaf_number, first_key, tree_depth = 0, next(iter(decision_tree)), 0; second_dict = decision_tree[first_key]
    for key in second_dict.keys():
        if type(second_dict.get(key)).__name__ == "dict":
            temp_number, temp_depth = get_leaf_number_and_tree_depth(second_dict[key])
            leaf_number, curr_depth = leaf_number + temp_number, 1 + temp_depth
        else: leaf_number += 1; curr_depth = 1
        if curr_depth > tree_depth: tree_depth = curr_depth
    return leaf_number, tree_depth

from matplotlib.font_manager import FontProperties

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:绘制节点
"""
def plot_node(node_text, center_pt, parent_pt, node_type):
    arrow_args = dict(arrowstyle = "<-")
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)    # 设置字体
    create_plot.ax1.annotate(node_text, xy=parent_pt,  xycoords='axes fraction',
                            xytext=center_pt, textcoords='axes fraction',
                            va="center", ha="center", bbox=node_type, arrowprops=arrow_args, FontProperties=font)

"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:标注有向边的值
"""
def tag_text(cntr_pt, parent_pt, node_text):
    x_mid = (parent_pt[0] - cntr_pt[0]) / 2.0 + cntr_pt[0]
    y_mid = (parent_pt[1] - cntr_pt[1]) / 2.0 + cntr_pt[1]
    create_plot.ax1.text(x_mid, y_mid, node_text, va="center", ha="center", rotation=30)
"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:绘制决策树
"""
def plot_tree(decision_tree, parent_pt, node_text):
    decision_node = dict(boxstyle="sawtooth", fc="0.8")
    leaf_node = dict(boxstyle = "round4", fc = "0.8")
    leaf_number, tree_depth = get_leaf_number_and_tree_depth(decision_tree)
    first_key = next(iter(decision_tree))
    cntr_pt = (plot_tree.xOff + (1.0 + float(leaf_number)) / 2.0 / plot_tree.totalW, plot_tree.yOff)
    tag_text(cntr_pt, parent_pt, node_text); plot_node(first_key, cntr_pt, parent_pt, decision_node)
    second_dict = decision_tree[first_key]
    plot_tree.yOff = plot_tree.yOff - 1.0 / plot_tree.totalD
    for key in second_dict.keys():
        if type(second_dict[key]).__name__ == 'dict': plot_tree(second_dict[key], cntr_pt, str(key))
        else:
            plot_tree.xOff = plot_tree.xOff + 1.0 / plot_tree.totalW
            plot_node(second_dict[key], (plot_tree.xOff, plot_tree.yOff), cntr_pt, leaf_node)
            tag_text((plot_tree.xOff, plot_tree.yOff), cntr_pt, str(key))
    plot_tree.yOff = plot_tree.yOff + 1.0 / plot_tree.totalD

from matplotlib import pyplot as plt
"""
    Author: Taoye
    微信公众号: 玩世不恭的Coder
    Explain:创建决策树
"""
def create_plot(in_tree):
    fig = plt.figure(1, facecolor = "white")
    fig.clf()
    axprops = dict(xticks = [], yticks = [])
    create_plot.ax1 = plt.subplot(111, frameon = False, **axprops)
    leaf_number, tree_depth = get_leaf_number_and_tree_depth(in_tree)
    plot_tree.totalW, plot_tree.totalD = float(leaf_number), float(tree_depth)
    plot_tree.xOff = -0.5 / plot_tree.totalW; plot_tree.yOff = 1.0
    plot_tree(in_tree, (0.5,1.0), '')
    plt.show()
    
if __name__ == "__main__":
    data, labels = establish_data()
    decision_tree = establish_decision_tree(data, labels, list())
    print(decision_tree)
    print("决策树的叶子节点数和深度:", get_leaf_number_and_tree_depth(decision_tree))
    create_plot(decision_tree)
Copy the code

The result of the manually visualized decision tree is as follows:

To be honest, manually visualizing the decision tree through Matplotlib is a bit unfriendly to inexperienced coders. You don’t have to worry too much about drawing a decision tree using Graphviz. Here’s a quick note on the code above:

  • get_leaf_number_and_tree_depthIt is mainly used to count the number of leaf nodes and the depth of decision tree. selectkeyThe correspondingvalueTo determinevalueIf it is a dictionary type, if it is not a leaf node, if it is not a leaf node, it is a non-leaf node
  • plot_nodeThe font method is used to draw nodes, setting font type for Windows, or additional setting for Linux
  • tag_textThe value of the attribute used to mark the directed edge is mainly marked with 1 and 0, where 1 represents the affirmation of the attribute and 0 represents the negation
  • plot_treeIterate over drawing the decision tree, where you need to invoke the methods defined earlier

Second, classification prediction based on the established decision tree

After constructing decision tree based on training data, we can apply the decision tree model to actual data to classify. Decision tree and label vector used to construct decision tree are needed to classify data. The program then compares the test data with the values in the decision tree and performs the process recursively until it reaches the leaf node. Finally, the test data is defined as the type to which the leaf node belongs. — Machine Learning in Action

It makes sense to classify test data when the decision tree model is already available

To do this, we define a classify method:

"" Author: Taoye wechat official number: Def classify(decision_tree, best_feature_labels, test_data): first_node = next(iter(decision_tree)) second_dict = decision_tree[first_node] feat_index = best_feature_labels.index(first_node) for key in second_dict.keys(): if int(test_data[feat_index]) == int(key): if type(second_dict[key]).__name__ == "dict": Result_label = classify(second_dict[key], best_feature_labels, test_data) else: result_label = second_dict[key] return result_labelCopy the code

We tested four data samples respectively, which are (house, no job), (house, no job), (house, no job), (house, no job), respectively represented by a list [1, 0], [1, 1], [0, 0], [0, 1], and the running results are as follows:

It can be seen that the four groups of data can be classified successfully.

3. How to save and read the constructed decision tree model?

Once the decision tree is built, we can save the model through the pickle module.

Once you save the model, you don’t need to retrain the model the next time you use it, just load the model. Example code for saving and loading the model is as follows (it’s pretty simple, so I won’t go into details) :

import pickle with open("DecisionTreeModel.txt", "wb") as f: F = open(" decisionTreemodel.txt ", "rb") decision_tree = pickle.load(fCopy the code

4. Sklearn is used to build decision trees through Iris data sets

Now we use Sklearn to implement a small case, using the Iris data set commonly used in machine learning. More of the decision tree classification, you can go to sklearn. Tree. DecisionTreeClassifier document study: scikit-learn.org/stable/modu…

Implement the decision tree classification is mainly used in sklearn interface is sklearn tree. DecisionTreeClassifier, this mainly through constructing a decision tree model data set. In addition, if we want to visualize the decision tree, we need to use export_graphviz. Of course, there are other interfaces under sklearning. tree that you can call.

About sklearn. Tree. The use of DecisionTreeClassifier, may refer to: scikit-learn.org/stable/modu… . There are a lot of built-in parameters, 8 parameters are mainly recorded here, which is also convenient for later review, and other parameters can be used to find information:

  • criterion: Attribute selection criteria. The default isginiOr you can choose by yourselfentropy.giniIs the Gini value,entropyIt’s entropy of information, both of which we talked about in the last article
  • splitter: Selection criteria for feature partitioning nodes. The default value is Best and can be set to random. The default “best” is suitable for a small sample size, while if the sample data is very large, “random” is recommended for decision tree construction.
  • max_depth: Maximum depth of decision tree. Default is None. ** It is important to note that this depth does not include the root node. ** In general, you can leave this value alone when there are few data or features. If the model sample size is large and features are large, it is recommended to limit the maximum depth. The specific value depends on the distribution of data.
  • max_features: Maximum number of features to consider when dividing. Default is None. Generally speaking, if the number of features in the sample is small, such as less than 50, we use the default “None”. If the number of features is very large, we can flexibly use other values to control the maximum number of features considered when dividing, so as to control the generation time of the decision tree. Just check the documentation when you need it.
  • min_samples_split: Minimum number of samples required for internal node subdivision. The default value is 2. That means, let’s say we have a property that has a sample size less thanmin_samples_split, even if it meets the priority selection criteria, it will still be rejected.
  • min_samples_leaf: Minimum number of leaf nodes. The default value is 1. thisValue limits the minimum sample number of leaf nodes. If the number of a leaf node is less than the sample number, it will be pruned together with its brother nodes. Leaf nodes need the minimum number of samples, that is, how many samples are needed to count a leaf node. If set to 1, the decision tree will be built even if there is only one sample of the category.
  • max_leaf_nodes: Maximum number of leaf nodes. Default is None. Overfitting can be prevented by limiting the maximum number of leaf nodes. If the limit is added, the algorithm will build the optimal decision tree within the maximum number of leaves. If there are few features, this value can be ignored, but if there are many features, it can be restricted, and the exact value can be obtained by cross-validation.
  • random_state: random number seed, default is None. If no random number is set, the random number is related to the current system time, and each time is different. If a random number seed is set, the same random number seed will generate the same random number at different times.

In addition, there are many methods available under DecisionTreeClassifier. For details, please refer to the documentation as follows:

Next, let’s use Sklearn to classify iris data set. Reference: scikit-learn.org/stable/auto…

It is not difficult to build the code of the decision tree itself, mainly for visualization, which involves a number of operations in Matplotlib to enhance the visualization, the complete code is as follows:

import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier, plot_tree class IrisDecisionTree: """ Explain: initialization Parameters for attributes: N_classes: the number of iris classes plot_colors: "" def __init__(self, n_classes, plot_colors, plot_step): self.n_classes = n_classes self.plot_colors = plot_colors self.plot_step = plot_step """ Explain: Construct data set by load_iris """ def establish_data(self): iris_info = load_iris() return iris_info.data, iris_info.target, iris_info.feature_names, "" def show_result(self, x_data, y_label, feature_names, target_names): For index, pair in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]): sub_x_data, sub_y_label = x_data[:, pair], y_label clf = DecisionTreeClassifier().fit(sub_x_data, Subplot (2, 3, index + 1) x_min, x_max = sub_x_data[:, 0]. Min () -1, sub_x_data[:, 0]. Max () + 1 # first attribute y_min y_max = sub_x_data [: 1] min () - 1, sub_x_data [: 1] Max () + 1 # the second attribute of xx, yy = np.meshgrid(np.arange(x_min, x_max, self.plot_step), np.arange(y_min, y_max, Tight_layout (h_pad=0.5, w_pad=0.5, pad=2.5) Z = clf.predict(np.c_[xx.ravel(), 0 yn.ravel ()]).0 0 (xx, yy, Z, 0 Xlabel (feature_names[pair[0]]); Plt.ylabel (feature_names[pair[1]]) for I, color in zip(range(self.n_classes), self.plot_colors): idx = np.where(sub_y_label == i) plt.scatter(sub_x_data[idx, 0], sub_x_data[idx, 1], c=color, label=target_names[i], cmap=plt.cm.RdYlBu, edgecolor='black', From matplotlib.font_manager import FontProperties font = FontProperties(fname=r"c:\ Windows \fonts\simsun.ttc", size=14) # fontproperties=font) plt.legend(loc='lower right', borderpad=0, handletextpad=0) plt.axis("tight") plt.figure() clf = DecisionTreeClassifier().fit(x_data, Plt.show () if __name__ == "__main__": plot_tree(CLF, filled=True) plt.show() if __name__ == "__main__": Iris_decision_tree = IrisDecisionTree(3, "ryb", 0.02) X_data, y_label, feature_names, target_names = iris_decision_tree.establish_data() iris_decision_tree.show_result(x_data, y_label, feature_names, target_names)Copy the code

The running result is as follows:

Through the visualization results, we can find two main results, and we will explain them respectively: in terms of iris data set, there are altogether four attribute features and three label results. To facilitate visualization, the first graph selects only two attributes to build the decision tree. Of the four attributes, choose two, very simple, anyone who has learned permutation and combination should know that there are 6 possibilities: C42=6C_4^2=6C42=6, so each subgraph in the first graph corresponds to one possibility. And different colors represent different classifications. If the color of the data set is consistent with the internal color of the grid, the classification is correct. So, intuitively, this is a relatively good decision tree based on the two attributes of Sepal Length and petal Length. The second figure constructs a decision tree for all attribute features in the data set. The specific result can be viewed by running the above code (because the font font is set, the above code must be run under Windows).

Graphviz visualized the decision tree using sklearn.tree.plot_tree. Graphviz visualized the decision tree using Matplotlib. Graphviz visualized the decision tree using Matplotlib.

The installation of Graphviz cannot be done by PIP, and the installation of Graphviz using Anaconda is very slow, and the installation may fail even after several attempts. This happened a few days ago when I installed graphviz for my classmates (this happened in Windows, but it is very convenient in Linux), so we installed it directly through WHL file.

Advice: Pyer, who has been using Python for a while, will often install third-party modules, some of which can be solved perfectly through PIP or Conda, while others will encounter various unexplained errors during the installation process. Therefore, readers who encounter errors during installation may wish to try to passwhlFile to install,whlDestination Address:www.lfd.uci.edu/~gohlke/pyt…, which integrates various Python moduleswhlFile.

Search for Graphviz at the url above -> CTRL + F -> download the desired Graphviz installation file

Install in the local target path:

PIP install graphviz ‑ ‑ 0.15 py3 ‑ none ‑. Any WHLCopy the code

For Windows, go to the official website and install the exe file of Graphviz, then add the bin directory to the environment variable. Exe file download address: graphviz.org/download/

If you are a Linux user, it is more convenient, directly through the command installation can be:

$ sudo apt install graphviz         # Ubuntu
$ sudo apt install graphviz         # Debian
Copy the code

At this point, Graphviz is installed. To visualize the decision tree, add the show_result_by_graphviz method to the IrisDecisionTree:

Def show_result_by_graphviz(self, x_data, y_label): """ def show_result_by_graphviz(self, x_data, y_label): clf = DecisionTreeClassifier().fit(x_data, y_label) iris_dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) import graphviz graph = graphviz.Source(iris_dot_data); graph.render("iris")Copy the code

After running, it will generate a PDF file in the current directory, which is the decision tree after visualization. Note: The above is only a simple decision tree classification of iris. Readers can construct different decision trees by mediating the parameters of the DecisionTreeClassifier, so as to judge the merits and demerits of each decision tree.

This is the end of this article, the decision tree related content is temporarily updated here, other content, such as overfitting, pruning, etc., will be updated later, the next machine learning series is the liver SVM nonlinear model.

Don’t talk ~~~

I am Taoye, love study, love to share, is keen on all kinds of technology, the study of anime like playing chess, listening to music, chat, hoping to worlds to record your growth process as well as the life intravenous drip, also hope to be able to strong more within the circle of like-minded friends, more welcome visiting WeChat princess: cynicism Coder.

References:

[1] Peter Harrington, Posts and Telecommunications Press [2] Statistical Learning Methods, 2nd Ed. Li Hang, Tsinghua University Press [3] Machine Learning, Zhou Zhihua, Tsinghua University Press [4] www.lfd.uci.edu/~gohlke/pyt… [5] sklearn.tree.DecisionTreeClassifier:scikit-learn.org/stable/modu… [6] Graphviz官网 : graphviz.org/

Recommended reading

Machine Learning in Action: Support Vector Machines (SVM), Print (“Hello, NumPy! “) ) do what what not, eat the first Taoye penetration into a black platform headquarters, the truth behind the very fear of “Tai Hua database” -SQL statement execution, what did the bottom? Git is a deep learning environment based on Ubuntu+Python+Tensorflow+Jupyter Notebook