Above introduction

The last article mainly introduced the following aspects:

  • An introduction to decision trees
  • The flow of decision trees
  • What is entropy and how is it calculated
  • Definition of information gain and how to calculate information gain
  • Data sets are divided according to information gain

Based on a new data set (contact lens data set), this paper realizes the construction of decision tree, the preservation and loading of decision tree, the classification of decision tree and the visualization of decision tree. The previous knowledge is not too much overview, focusing on these four aspects.

Take a look at the data set:

This data source can be applied to UCI database, and its four characteristics are age(age), prescript(symptom), Astigmatic (flash), tearRate(tear production rate) and a classification label class, which includes hard material, soft material and should not be applied.

In order to facilitate processing, the samples are treated as follows:

  1. Age: young – >0, pre – >1, presbyopic – >2
  2. Prescript: Myope — >0, Hyper — >1
  3. Astigmatic: no >0, yes >1
  4. TearRate: Reduced value – >0 and Normal value – >1

Fourth, the construction of decision tree

Before constructing the decision tree to review before the working principle of several sub-modules: to get original data set, and the optimal characteristic parameters based on the data set, when the characteristics of data set is greater than two, the first division, the data will be passed down to the next node of the tree, at this point, in the classified data, this process is to use recursive principle to deal with data sets.

When does the division end? When the program traverses all the attributes of the partition data set, or all instances under each branch are classified identically, the partition data set ends.

The process of constructing decision tree is to fill each divided data into a dictionary. When the division of data set ends, filling data into the dictionary also ends. This process is also a recursive process, so the construction of decision tree is completed. The code is as follows:

def CreateTree(DataSet):

Get all feature tags

index_list = list(DataSet.columns)

Get the category of the last column (category tag)

label_series = DataSet.iloc[:,- 1].value_counts()

Determine if at most one category tag is equal to the number of data samples, or if the data set has only one column

if label_series[0]==DataSet.shape[0] or DataSet.shape[1] = =1:

return label_series.index[0] # return the class tag

Get the optimal feature column index

col = ChooseBF(DataSet)

# Get the optimal feature

BestFeature = index_list[col]

# Fill the dictionary with the best features in turn

TheTree = {BestFeature:{}}

Delete the feature tag from the tag list

del index_list[col]

Extract all attribute values of the optimal tangent column

value_list = set(DataSet.iloc[:,col])

# Use recursive method to build a tree, each time the object is the current optimal feature

for value in value_list:

TheTree[BeatFeature][value] = CreateTree(splitSet(DataSet,col,value))

return TheTree

Copy the code

The first stop condition of a recursive function is that all class labels are the same. The second stop condition of a recursive function is that all features of the data set are used up, that is, the data set cannot be further divided. The dictionary variable TheTree stores all the information of TheTree, and BestFeature is the current optimal feature.

CreateTree() is recursively called on each partition of the dataset, and the parameter passed in is the partition of the dataset after each partition. The resulting return value is inserted into the dictionary TheTree. At the end of the recursion, the dictionary will nest many data representing the information of the leaf node.

Get TheTree dictionary as follows:

{'tearRate': {0: 'no lenses'.1: {'astigmatic': {0: {'age': {0: 'soft'.1: 'soft'.2: {'prescript': {0: 'no lenses'.1: 'soft'}}}}, 1: {'prescript': {0: 'hard'.1: {'age': {0: 'hard'.1: 'no lenses'.2: 'no lenses'}}}}}}}}

Copy the code

Starting from the left, the first key is tearRate, which means that of all the features, the tearRate feature has the largest information gain, and the data drop (partition) is the fastest under this feature. The value of this keyword is also a dictionary. The second keyword is a data set divided by tearRate characteristics, and the values of these keywords are the children of the tearRate node.

These values could be class tags, or they could be another dictionary. If the value is a class label, the child node is a leaf node. If the value is another dictionary, the child node is a judgment node, and repetition through this format forms a decision tree.

5. Save and load decision tree

There are many ways to save a decision tree, but the principle is the same, namely serialization and deserialization. Here are the following two methods.

# The first way

np.save('TheTree.npy',TheTree)

read_tree = np.load('TheTree.npy',allow_pickle=True).item()

Copy the code

The first method is to use the save method in numpy library, which can save the decision tree in dictionary format as NPY file. When reading the tree, we need to add item() to the end of the method because we are storing data of dictionary type and delete it if we are storing data of matrix type.

# Second way

import pickle

def storeTree(inputTree, filename):

fw = open(filename,'wb')

pickle.dump(inputTree,fw)

fw.close()

def grabTree(filename):

fr = open(filename,'rb')

return pickle.load(fr)

Copy the code

The second method is to use the dump method of the pickle library to serialize data. When reading data, use the load method to load data. Note that both writing and reading must be in binary format, otherwise errors will occur.

Sixth, use decision tree classification

After the decision tree is constructed, it can be used for the classification of actual data. When the data classification is performed, the decision tree, feature label list and test data for classification need to be passed in. The program then compares the test data with the values in the decision tree, recursively performing the process until it reaches the leaf node, and the resulting classification result is the type of leaf node. The code is as follows:

The data passed in are decision tree, dataset feature tag and test data

def classify(inputTree,labels,testVec):

Get the first node of the decision tree

FirstStr = list(inputTree.keys())[0]

# fetch the next dictionary after the first node

SecondDict = inputTree[FirstStr]

' ' '

{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}

{0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}

{0: 'no', 1: 'yes'}

' ' '


# Take the index of the first node in labels

feat_index = labels.index(FirstStr)

Walk through the keys in the dictionary

for key in SecondDict.keys():

# compare the value in testVec with the value of the tree node, if the leaf node is reached, return the class label

if testVec[feat_index]==key:

If the next dictionary still contains the dictionary, the recursive comparison continues

if type(SecondDict[key])==dict :

classlabel = classify(SecondDict[key],labels,testVec)

else:

# until the class tag is finally fetched

classlabel = SecondDict[key]

return classlabel

Copy the code

The function of passing feature labels list labels is to help determine the index of the optimal feature in the data set each time. The index method is used to find the first element matching the FirstStr variable in the current list. Then the code recursively traverses the whole tree and compares the value in the test data testVec variable with the value of the tree node until the leaf node is reached. Returns the classification label of the current node.

Here is a SecondDict example using the tree constructed from the data in the previous article. Its purpose is to obtain the value of the optimal feature (the first keyword) in the current dictionary, so as to achieve the effect of recursive comparison with the test data.

classlabel = classify(inputTree,labels,[0.1.0.0])

' ' '

no lenses

' ' '


Copy the code

By executing this function, the incoming data can be compared with the data in the original text, and the classification results are consistent.

Visualization of decision tree

The main advantage of decision trees is that they are intuitive and easy to understand. Drawing a decision tree from the Matplotlib library is a very complex process, so here’s an easy way to do it.

Graphviz is a graph-drawing tool that can draw many graph structures, but the incoming data needs to be in dot format, so here we use the decision tree generated by SkLearn for visualization.

Graphviz needs to be downloaded manually, and after installation you need to configure the environment to add the path of the folder to the system variablePath“, and finally enter in CMDdot -versionIf the version information is displayed, the installation and configuration is successful.

The visual code of decision tree is as follows:

from sklearn import tree

from sklearn.tree import DecisionTreeClassifier

import pydotplus

import graphviz

def pic_tree(DataSet):

# All characteristic data

feature_train = DataSet.iloc[:, :- 1]

# Class tag data

the_label = DataSet.iloc[:, - 1]

#unique deduplicates the class tag

labels = the_label.unique().tolist()

# Replace the tag with the index of the class tag in the list -- converted to a number

the_label = the_label.apply(lambda x: labels.index(x))

# Training data

clf = DecisionTreeClassifier()

clf = clf.fit(feature_train, the_label)

# Drawing process

dot_data = tree.export_graphviz(clf, out_file=None, feature_names=['age'.'prescript'.'astigmatic'.'tearRate'].

class_names=['no lenses'.'soft'.'hard'], filled=True, rounded=True.

special_characters=True)

# Two ways

# 1. Generate PDF images using graphviz library

pic = graphviz.Source(dot_data)

pic.render('lense')

# 2. Convert Dot format to PDF using pyDotPlus library

#graph = pydotplus.graph_from_dot_data(dot_data)

#return graph.write_pdf("lense.pdf")

Copy the code

There are also two ways to generate decision tree images. The first is using the Graphviz librarySourceMethod to generate PDF images. The second method requires the use of PyDotPlus library to convert Dot format into PDF, and the final visual images are as follows:

conclusion

Overall, this classification algorithm is easy to understand, but it is very important, because it lays a foundation for learning random forest in the future. Each algorithm has its own suitable environment, and decision tree also has its own shortcomings.

Advantages of decision trees:

  • The computational complexity is not high and the output result is easy to understand.
  • Not sensitive to intermediate missing values.
  • Can process irrelevant characteristic data.
  • Trees can be graphical.

Disadvantages of decision trees:

  • Overfitting occurs when the decision tree is too complex.
  • Relatively unstable, small changes in data will also lead to the generation of different trees.
  • In the case of imbalanced samples, different weights will lead to deviation of the tree.

Public number [milk candy cat] background reply “contact lenses” can be obtained source code and data for reference, thank you for reading.