Decision Tree is a basic classification and regression method. This paper mainly discusses classification Decision Tree. The decision tree model is a tree structure. In the classification problem, it represents the process of classifying instances based on features. It can be regarded as a set of if-then rules or a conditional probability distribution defined in feature space and class space. Compared with naive Bayes classification, the advantage of decision tree is that the construction process does not require any domain knowledge or parameter setting, so in practical application, decision tree is more suitable for probing knowledge discovery.

4.1 Decision tree theory

** Definition: ** Classification decision tree model is a tree structure describing the classification of instances. A decision tree consists of nodes and directed edges. There are two types of nodes: inner node, which represents a feature or attribute, and leaf node, which represents a class. In classification, starting from the root node, a certain feature of the instance is tested. According to the test results, the instance is assigned to its children. In this case, each child corresponds to a value of this feature. And so it goes down recursively until it reaches the leaf, and then it allocates the instance to the class of the leaf.

4.1.1 Information theory basis of decision tree ID3 algorithm

In 1948, Daniel Shannon introduced the concept of entropy: The amount of information of a piece of information is directly related to its uncertainty. To make clear an uncertain thing or something we know nothing about, we need to know a lot of information. However, the measure of information is equal to the amount of uncertainty, which is also the source of entropy.

In the 1970s, a great man named J. Ross.Quinlan found a method to measure the decision selection process of decision trees by entropy in information theory. Once it was developed, its simplicity and efficiency caused a sensation. Quinlan called this algorithm ID3. Let’s take a look at how the ID3 algorithm selects features. First, we need to be familiar with the concept of entropy in information theory. Entropy measures the uncertainty of something, and the more uncertain something is, the greater its entropy.

Let X X X be a discrete random variable with finite values, and its probability distribution is P(X=z 1)= P I, I =1,2.. n \color{red}P(X=z_1)=p_i, I =1,2… , n P (X = z1) = PI, I = 1, 2,… ,n

The entropy of the random variable X X X is expressed as follows: H(X)=−∑ I =1np I logp I \color{red}H(X)=-\sum_{I =1}^{n}p_ilogp_i H(X)=−∑ I = 1Npilogpi

Where n n n is X X X X n n different discrete values. And p, I, p_i, PI is the probability that X, X, X is equal to I, I, I, l, g, log, log is log base two, two, two, or e, e, e. When the logarithm base is 2, 2, 2, the unit of entropy is bit; If the value is e, the unit is NAT. The higher the entropy, the greater the uncertainty of the random variable. It can be verified by definition: 0 < H ( p ) < l o g n 0< H(p)< logn 0

And just to get a good idea of what entropy is, let’s say that X, X, X has 2, 2, 2 possible values, each of which is 1/2, 1/2, 1/2 and X, X, X has the maximum entropy, and X, X, X has the maximum uncertainty. The value is H(X)= − (1/2 L O g1/2+1/2 L O g1/2)= L og2 H(X)=-(1/2log1/2+1/2log1/2)=log2 H (X) = – (1/2 log1/1/2 log1/2 + 2) = log2

If the probability of one value is greater than 1/2, 1/2, 1/2, and the probability of the other value is less than 1/2, 1/2, 1/2, then the uncertainty decreases and the corresponding entropy decreases. Let’s say I have a probability of 1/3, 1/3, 1/3, a probability of 2/3, 2/3, 2/3, H(X)= − (1/3 L O g1/3+2/3 L O g2/3) = L og 3 − 2/3 L O g2 H(X)=-(1/3log1/3+2/3log2/3) = Log3-2/3 log2 H(X)=−(1/3log1/3+2/3log2/3)=log3−2/3log2 L O g3−2/3 L og2 < l o g 2 log3-2/3log2< < log2 log2 log2 log3-2/3

Familiar with the entropy of a variable X X X, it is easy to generalize to the joint entropy of multiple variables. The joint entropy expressions of two variables X X X and Y Y Y are given here: H (X, Y) = − ∑ I = 1 n p (X I, Y I) l o g p (X I, Y, I), color (X, y) = {red} H – \ sum_ {I = 1} ^ {n} p (x_i y_i) logp (x_i y_i) H (X, y) = – ∑ I = 1 np (xi, yi) logp (xi, yi)

A joint entropy, and can get expression of conditional entropy H (X ∣ Y) H (X | Y) H (X ∣ Y), conditional entropy is similar to the conditional probability, it measures the X X X in we know Y Y Y the rest of the uncertainty in the future. The expression is as follows: H of X ∣ Y is equal to minus ∑ I = one n p of X I, Y I o g p (x I ∣ y I) = ∑ j = 1 n p (y j) H (x ∣ y I) \color{red}H(X|Y)=-\sum_{i=1}^{n}p(x_i,y_i)logp(x_i|y_i)= \sum_{j=1}^{n}p(y_j)H(X|y_i) ∣ Y H (X) = – ∑ I = 1 np (xi, yi) logp (xi ∣ yi) = ∑ j = 1 np (yj) H (X ∣ yi)

H H H (X) (X) (X) to measure the uncertainty of X X X, condition entropy H (X ∣ Y) H (X | Y) H (X ∣ Y) measured after we know Y Y Y X X X the remaining uncertainties, Then H (X) – H H (X) (X ∣ Y) – H (X | Y) H (X) – H (X ∣ Y)? As you can see from the description above, it measures the reduction in uncertainty of X, X, X after knowing Y,Y,Y, which we call mutual information in information theory, I(X,Y) I(X,Y) I(X,Y). It’s called information gain in the decision tree ID3 algorithm. ID3 algorithm is to use information gain to determine what features should be used by the current node to build a decision tree. The larger the information gain, the more suitable it is for classification.

References: [1] Machine Learning, Zhou Zhihua, [2] Statistical Learning Methods, Li Hang