Above introduction
The last article mainly introduced the following aspects:
- An introduction to decision trees
- The flow of decision trees
- What is entropy and how is it calculated
- Definition of information gain and how to calculate information gain
- Data sets are divided according to information gain
Based on a new data set (contact lens data set), this paper realizes the construction of decision tree, the preservation and loading of decision tree, the classification of decision tree and the visualization of decision tree. The previous knowledge is not too much overview, focusing on these four aspects.
Take a look at the data set:
This data source can be applied to UCI database, and its four characteristics are age(age), prescript(symptom), Astigmatic (flash), tearRate(tear production rate) and a classification label class, which includes hard material, soft material and should not be applied.
In order to facilitate processing, the samples are treated as follows:
- Age: young – >0, pre – >1, presbyopic – >2
- Prescript: Myope — >0, Hyper — >1
- Astigmatic: no >0, yes >1
- TearRate: Reduced value – >0 and Normal value – >1
Fourth, the construction of decision tree
Before constructing the decision tree to review before the working principle of several sub-modules: to get original data set, and the optimal characteristic parameters based on the data set, when the characteristics of data set is greater than two, the first division, the data will be passed down to the next node of the tree, at this point, in the classified data, this process is to use recursive principle to deal with data sets.
When does the division end? When the program traverses all the attributes of the partition data set, or all instances under each branch are classified identically, the partition data set ends.
The process of constructing decision tree is to fill each divided data into a dictionary. When the division of data set ends, filling data into the dictionary also ends. This process is also a recursive process, so the construction of decision tree is completed. The code is as follows:
def CreateTree(DataSet):
Get all feature tags
index_list = list(DataSet.columns)
Get the category of the last column (category tag)
label_series = DataSet.iloc[:,- 1].value_counts()
Determine if at most one category tag is equal to the number of data samples, or if the data set has only one column
if label_series[0]==DataSet.shape[0] or DataSet.shape[1] = =1:
return label_series.index[0] # return the class tag
Get the optimal feature column index
col = ChooseBF(DataSet)
# Get the optimal feature
BestFeature = index_list[col]
# Fill the dictionary with the best features in turn
TheTree = {BestFeature:{}}
Delete the feature tag from the tag list
del index_list[col]
Extract all attribute values of the optimal tangent column
value_list = set(DataSet.iloc[:,col])
# Use recursive method to build a tree, each time the object is the current optimal feature
for value in value_list:
TheTree[BeatFeature][value] = CreateTree(splitSet(DataSet,col,value))
return TheTree
Copy the code
The first stop condition of a recursive function is that all class labels are the same. The second stop condition of a recursive function is that all features of the data set are used up, that is, the data set cannot be further divided. The dictionary variable TheTree stores all the information of TheTree, and BestFeature is the current optimal feature.
CreateTree() is recursively called on each partition of the dataset, and the parameter passed in is the partition of the dataset after each partition. The resulting return value is inserted into the dictionary TheTree. At the end of the recursion, the dictionary will nest many data representing the information of the leaf node.
Get TheTree dictionary as follows:
{'tearRate': {0: 'no lenses'.1: {'astigmatic': {0: {'age': {0: 'soft'.1: 'soft'.2: {'prescript': {0: 'no lenses'.1: 'soft'}}}}, 1: {'prescript': {0: 'hard'.1: {'age': {0: 'hard'.1: 'no lenses'.2: 'no lenses'}}}}}}}}
Copy the code
Starting from the left, the first key is tearRate, which means that of all the features, the tearRate feature has the largest information gain, and the data drop (partition) is the fastest under this feature. The value of this keyword is also a dictionary. The second keyword is a data set divided by tearRate characteristics, and the values of these keywords are the children of the tearRate node.
These values could be class tags, or they could be another dictionary. If the value is a class label, the child node is a leaf node. If the value is another dictionary, the child node is a judgment node, and repetition through this format forms a decision tree.
5. Save and load decision tree
There are many ways to save a decision tree, but the principle is the same, namely serialization and deserialization. Here are the following two methods.
# The first way
np.save('TheTree.npy',TheTree)
read_tree = np.load('TheTree.npy',allow_pickle=True).item()
Copy the code
The first method is to use the save method in numpy library, which can save the decision tree in dictionary format as NPY file. When reading the tree, we need to add item() to the end of the method because we are storing data of dictionary type and delete it if we are storing data of matrix type.
# Second way
import pickle
def storeTree(inputTree, filename):
fw = open(filename,'wb')
pickle.dump(inputTree,fw)
fw.close()
def grabTree(filename):
fr = open(filename,'rb')
return pickle.load(fr)
Copy the code
The second method is to use the dump method of the pickle library to serialize data. When reading data, use the load method to load data. Note that both writing and reading must be in binary format, otherwise errors will occur.
Sixth, use decision tree classification
After the decision tree is constructed, it can be used for the classification of actual data. When the data classification is performed, the decision tree, feature label list and test data for classification need to be passed in. The program then compares the test data with the values in the decision tree, recursively performing the process until it reaches the leaf node, and the resulting classification result is the type of leaf node. The code is as follows:
The data passed in are decision tree, dataset feature tag and test data
def classify(inputTree,labels,testVec):
Get the first node of the decision tree
FirstStr = list(inputTree.keys())[0]
# fetch the next dictionary after the first node
SecondDict = inputTree[FirstStr]
' ' '
{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
{0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}
{0: 'no', 1: 'yes'}
' ' '
# Take the index of the first node in labels
feat_index = labels.index(FirstStr)
Walk through the keys in the dictionary
for key in SecondDict.keys():
# compare the value in testVec with the value of the tree node, if the leaf node is reached, return the class label
if testVec[feat_index]==key:
If the next dictionary still contains the dictionary, the recursive comparison continues
if type(SecondDict[key])==dict :
classlabel = classify(SecondDict[key],labels,testVec)
else:
# until the class tag is finally fetched
classlabel = SecondDict[key]
return classlabel
Copy the code
The function of passing feature labels list labels is to help determine the index of the optimal feature in the data set each time. The index method is used to find the first element matching the FirstStr variable in the current list. Then the code recursively traverses the whole tree and compares the value in the test data testVec variable with the value of the tree node until the leaf node is reached. Returns the classification label of the current node.
Here is a SecondDict example using the tree constructed from the data in the previous article. Its purpose is to obtain the value of the optimal feature (the first keyword) in the current dictionary, so as to achieve the effect of recursive comparison with the test data.
classlabel = classify(inputTree,labels,[0.1.0.0])
' ' '
no lenses
' ' '
Copy the code
By executing this function, the incoming data can be compared with the data in the original text, and the classification results are consistent.
Visualization of decision tree
The main advantage of decision trees is that they are intuitive and easy to understand. Drawing a decision tree from the Matplotlib library is a very complex process, so here’s an easy way to do it.
Graphviz is a graph-drawing tool that can draw many graph structures, but the incoming data needs to be in dot format, so here we use the decision tree generated by SkLearn for visualization.
Graphviz needs to be downloaded manually, and after installation you need to configure the environment to add the path of the folder to the system variablePath“, and finally enter in CMDdot -versionIf the version information is displayed, the installation and configuration is successful.
The visual code of decision tree is as follows:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import pydotplus
import graphviz
def pic_tree(DataSet):
# All characteristic data
feature_train = DataSet.iloc[:, :- 1]
# Class tag data
the_label = DataSet.iloc[:, - 1]
#unique deduplicates the class tag
labels = the_label.unique().tolist()
# Replace the tag with the index of the class tag in the list -- converted to a number
the_label = the_label.apply(lambda x: labels.index(x))
# Training data
clf = DecisionTreeClassifier()
clf = clf.fit(feature_train, the_label)
# Drawing process
dot_data = tree.export_graphviz(clf, out_file=None, feature_names=['age'.'prescript'.'astigmatic'.'tearRate'].
class_names=['no lenses'.'soft'.'hard'], filled=True, rounded=True.
special_characters=True)
# Two ways
# 1. Generate PDF images using graphviz library
pic = graphviz.Source(dot_data)
pic.render('lense')
# 2. Convert Dot format to PDF using pyDotPlus library
#graph = pydotplus.graph_from_dot_data(dot_data)
#return graph.write_pdf("lense.pdf")
Copy the code
There are also two ways to generate decision tree images. The first is using the Graphviz librarySourceMethod to generate PDF images. The second method requires the use of PyDotPlus library to convert Dot format into PDF, and the final visual images are as follows:
conclusion
Overall, this classification algorithm is easy to understand, but it is very important, because it lays a foundation for learning random forest in the future. Each algorithm has its own suitable environment, and decision tree also has its own shortcomings.
Advantages of decision trees:
- The computational complexity is not high and the output result is easy to understand.
- Not sensitive to intermediate missing values.
- Can process irrelevant characteristic data.
- Trees can be graphical.
Disadvantages of decision trees:
- Overfitting occurs when the decision tree is too complex.
- Relatively unstable, small changes in data will also lead to the generation of different trees.
- In the case of imbalanced samples, different weights will lead to deviation of the tree.
Public number [milk candy cat] background reply “contact lenses” can be obtained source code and data for reference, thank you for reading.