Decision tree is a prediction model; It represents a mapping between object attributes and object values. Each node in the tree represents an object, each bifurcated path represents a possible attribute value, and each leaf represents a value for an object along the path from the root to the leaf. The decision tree only has a single output. If complex output is desired, an independent decision tree can be established to deal with different outputs. Decision tree is a frequently used technique in data mining, which can be used to analyze data as well as make predictions.
The machine learning technique of generating decision trees from data is called decision tree learning.
For example, we have to answer “shall WE go out to play? “When making decisions on such problems, we usually make a series of judgments or” sub-decisions “: we first look at” OUTLOOK “, if the weather is sunny, then we look at the air humidity, if it is “humidity<70” (humidity is less than 70), then go to play, otherwise not to play; If the weather is overcast, then play, the other nodes can not judge; And so on:
Generally, a decision tree consists of one root node and several internal nodes. Leaf nodes correspond to decision results, and each other node corresponds to an attribute test; The sample set contained in each node is divided into sub-nodes according to the result of attribute test. The root node contains the complete set of samples. The path from the root node to each leaf node corresponds to a sequence of decision tests. The purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, strong ability to deal with unknown examples. The basic process follows the simple and intuitive “divide and conquer” strategy, as shown below:
Before we can build a decision tree, we must first learn a very important concept, that is information entropy.
In 1948, Shannon introduced the concept of entropy. The amount of information of a piece of information is directly related to its uncertainty. To make clear a very, very uncertain thing, or something we know nothing about, we need to know a lot of information ==> The measure of information is equal to the amount of uncertainty.
Information entropy is the most commonly used index to measure the purity of sample set. Assuming that the proportion of the k-th sample in the current sample set D is [Math Processing Error], the information entropy of D is defined as:
The greater the uncertainty of a variable, the greater the entropy of information.
Decision Tree Induction Algorithm (ID3)
Information Gain: Gain(A) = Info(D) -infor_a (D), where Info(D) represents the Information obtained without joining node A, and Infor_A(D) represents the Information obtained after joining node A. In general, the greater the amount of information acquired, the greater the “purity improvement” achieved by using attribute A for partitioning. Therefore, we can use information acquisition to select partition attributes of decision tree. The famous ID3 decision tree learning algorithm selects partition attributes based on information acquisition.
In the picture above, for example,
Gain(age) = Info(D) -infor_A (D)=0.940-0.694=0.246.
Similarly, Gain(income) = 0.029, Gain(student) = 0.151, Gain(credit_rating)=0.048;
Gain(age)> Gain(student)>Gain(credit_rating)>Gain(income)
Repeat the above steps to arrive at the conclusion.
Example Python code is as follows:
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import
preprocessing
from sklearn.externals.six import StringIO
# Read in the csv file and put features into list of dict and
list of class label
allElectronicsData = open
(r'/home/zhoumiao/MachineLearning/01decisiontree/AllElectronics.csv'.'rb')
reader = csv.reader
(allElectronicsData)
headers = reader.next()
print(headers)
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1, len(row)-1):
rowDict[headers[i]] = row[i]
featureList.append(rowDict)
print(featureList)
# Vetorize features
vec = DictVectorizer()
dummyX =
vec.fit_transform(featureList) .toarray()
print("dummyX: " + str(dummyX))
print(vec.get_feature_names())
print
("labelList: " + str(labelList))
# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY =
lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))
# Using decision tree for classification
# clf =
tree.DecisionTreeClassifier()
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX,
dummyY)
print("clf: " + str(clf))
# Visualize model
with open("allElectronicInformationGainOri.dot".'w') as f:
f
= tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
oneRowX = dummyX[0To:]print
("oneRowX: " + str(oneRowX))
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " +
str(newRowX))
predictedY = clf.predict(newRowX)
print("predictedY: " + str(predictedY))
Copy the code
Other algorithms:
Classification and Regression Trees (CART): (L. Breiman, J. Friedman, R. Olshen, C. Stone)
Common ground: Greedy algorithms, top-down approaches
Differences: Different measurement methods of attribute selection: C4.5 (Gain Ratio), CART(GINi index), ID3 (Information Gain)
The biggest rule for dividing data sets is to make disordered data more orderly. If there are 20 features in a training data, which one should be selected as the basis for classification? It is necessary to use quantitative methods to judge the multiple methods of quantitative division, one of which is “information theory measures information classification”. Decision tree algorithms based on information theory include ID3, CART and C4.5, among which C4.5 and CART algorithms are derived from ID3 algorithm.
CART and C4.5 support the processing when the data feature is continuous distribution, mainly through the use of binary segmentation to deal with continuous variables, that is, to find a specific value-split value: the eigenvalue is greater than the split value, go to the left subtree, or go to the right subtree. The principle of selecting the split value is to reduce the “chaos degree” in the subtree after partitioning. Specifically, C4.5 and CART algorithms have different definitions.
The ID3 algorithm was invented by Ross Quinlan and is based on Occam’s Razor: the smaller the decision tree, the better the large one (be Simple theory). In ID3 algorithm, according to the information gain evaluation and feature selection of information theory, the feature with the maximum information gain is selected each time as the judgment module. ID3 algorithm can be used to divide nominal data sets without pruning. In order to eliminate the problem of excessive data matching, adjacent leaf nodes that cannot generate a large amount of information gain can be combined by pruning (such as setting the information gain threshold). Using information gain is actually have a drawback, that is it tend to have a lot of the value of the attribute – that is to say in the training set, the more the number of distinct values taken by a property, then the more likely it is to take it as a split attribute, which sometimes is meaningless, and ID3 cannot handle data characteristics of continuous distribution, hence the C4.5 algorithm. The CART algorithm also supports continuously distributed data features.
C4.5 is an improved algorithm of ID3, inheriting the advantages of ID3 algorithm. The C4.5 algorithm uses the information gain rate to select attributes, which overcomes the shortcoming of preferring to select the attributes with more values when using the information gain to select attributes. Can complete the discrete processing of continuous attributes; Ability to process incomplete data. Classification rules generated by C4.5 algorithm are easy to understand and accurate. However, the efficiency is low, because in the process of tree construction, the sequential scanning and sorting of data sets are needed for many times. Also because multiple cube scans are required, C4.5 is only suitable for data sets that can reside in memory.
The full name of CART algorithm is Classification And Regression Tree. It adopts Gini index (select the feature s with the smallest Gini index) as the splitting standard, And it also includes post-pruning operation. Although the ID3 algorithm and C4.5 algorithm can mine as much information as possible in the learning of the training sample set, the decision tree branches generated by them are large and the scale is large. In order to simplify the scale of decision tree and improve the efficiency of decision tree generation, the decision tree algorithm CART, which selects test attributes according to GINI coefficient, appears.
About pruning (Avoiding overfitting)
In the actual construction of decision trees, pruning is usually carried out to deal with over-fitting problems caused by noise and outliers in the data. There are two kinds of pruning:
Prune first: In the construction process, when a node meets the prune condition, the construction of the branch is stopped.
Post-pruning: a complete decision tree is constructed and pruned by traversing the tree under certain conditions.
Advantages and disadvantages of decision trees
Advantages: intuitive, easy to understand, effective in small data sets
Disadvantages: 1. Poor processing of continuous variables; 2. When there are more categories, errors increase more quickly; 3. Average scalability.
Machine Learning: A Decision Tree
Reference: << Statistical learning methods — Li Hang >>
Validation of Machine Learning with Model Selection
Logistic Regression for Machine Learning