Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

In the future, I will sum up my four years in college. Although IT was not perfect, I really tried my best to do everything I wanted to do.

Algorithm thought

A decision tree is not just a binary tree, it can have multiple forks.


In short, it means to judge layer by layer to get to the end result.

🥕 Tree structure

The ellipse is the inner node, representing the feature or the property, and the top node is the root node, where all the forks start

The rectangle is the leaf node and represents the class.

If-then rules:

Information gain

Decision tree is based on the characteristics of each node to judge layer by layer to get the final classification result, and the emergence of information gain is to find the optimal feature, how to order each feature to make the final judgment result is the most reasonable.

1. Concept of entropy

Information gain is constructed from entropy. Entropy refers to the uncertainty of random variables, and the greater the uncertainty, the greater the entropy.

Since entropy is related to the distribution of random variables, we can get:


H ( p ) = Σ i = 1 n p i l o g p i H (p) = – Σ _ ^ {I = 1} {n} p_ilog {p_i}

When the values of random variables are equally distributed, the corresponding entropy is maximum. That is to say, when all the values of features are the same, the information contained is the maximum, which is the maximum uncertainty.

2. Information gain

Input: Training data set D and feature A.

Output: information gain G (D,A) of feature A to training data set D.

🚶♂️ three steps ~

Step 1 Calculate empirical entropy H(D) of data set D


H ( D ) = Σ k = 1 K C k D l o g 2 C k D H (D) = – Σ _ ^ k \ frac {k = 1} {| C_k |} {| D |} log_2 \ frac {| C_k |} {| D |}

Step 2 computing features A condition of the experience of the data set D entropy H (D | A)

Step 3 Calculate the information gain formula


g ( D . A ) = H ( D ) H ( D A ) g(D,A)=H(D)-H(D|A)

The information gain represents the degree to which the uncertainty of the information of class Y is reduced by knowing feature A.

There’s a lot of formulas. Confused? 😅 see example immediately clear ~

3

Input: Training data set D

Output: The optimal feature is selected by calculating the information gain

Step 1 Calculate the empirical entropy formula

In this training data set, D=15, and there are two final classifications: yes (9) and No (6).

The empirical entropy is obtained by substituting it into the calculation:

Step 2 Calculate the empirical conditional entropy formula

Let A1: age, A2: whether you have a job, A3: whether you own a house, A4: credit situation

(1) age

The empirical conditional entropy of youth, middle age and old age can be obtained by substitution calculation:

I =1 Youth:

I =2 Middle-aged:

I =3 Old age:

Final empirical conditional entropy:

Step 3 Calculate the information gain formula

Similarly, we can calculate the information gain of other features

Continue with the above three steps and summarize the results into a table:

It can be seen that the characteristic of having one’s own house corresponds to the minimum entropy of empirical conditions and the maximum information gain, which means that if this feature is selected, the corresponding uncertainty is the least and classification selection is the most clear, so it can be set as the optimal feature 👍

Careful friend, of course, will find that if different characteristics in the classification number is different, a little is three, such as age, young, middle-aged and older), have a plenty of 2, such as whether to have his own house, classification number from time to tome can calculate the information gain more will be bigger, it can be seen that information gain value will be more inclined to more features.

🤔 How can I reduce the effect of having more values?

Don’t worry, the information gain ratio is coming

Information gain ratio

Entropy of training data set D on feature A:


H A ( D ) = Σ i = 1 n D i D l o g 2 D i D H_A (D) = Σ _ ^ n \ frac {I = 1} {| D_i |} {| D |} log_2 \ frac {| D_i |} {| D |}

Information gain ratio:


g R ( D . A ) = g ( D . A ) H A ( D ) g_R(D,A)=\frac{g(D,A)}{H_A(D)}

So how do we calculate HA(D), H_A(D), HA(D)?

Take characteristic A1= age A_1= age A1= age as an example:

Similarly, we continue to calculate the information gain ratio corresponding to A2, A3 and A4 features, listed in the table:

It can be seen from the perspective of information gain that it is most appropriate to choose credit status as the optimal feature by comparing the two features of having a job and credit status. From the Angle of information gain ratio, it is more appropriate to choose the job as the optimal feature

To sum up, the information gain tends to favor features with more values, and the information gain ratio tends to favor features with fewer values

Decision tree generation algorithm

1. The ID3 algorithm

Input: training data set DDD, feature set AAA, threshold ε\varepsilonε

Output: decision tree TTT

Step 1 judgment
T T
Whether feature selection is required to generate a decision tree

Case 1 If all instances in DDD belong to the same class, T is the single-node number, CkC_kCk is recorded as the class mark of this node, and TTT is returned

Case 2 If all instances in DDD have no ∅A=\emptysetA=∅), then TTT is A single-node tree, recording the category with the most instances in DDD, and CkC_kCk uses this as the class token for that node by majority vote, and returns TTT.

Step 2 Otherwise, calculate the information gain of each feature in A and select feature A with the maximum information gaing

Case 1 If the information gain of AgA_gAg is less than ε\varepsilonε, then TTT is a single-node tree, recording the category CkC_kCk with the largest number of instances in DDD.

Case 2 Otherwise, according to each possible value of AgA_gAg aiA_iai, divide DDD into several non-empty subsets DiD_iDi, mark the category with the largest number of instances in DiD_iDi, construct child nodes, construct TTT from node and its children, and return TTT

Step 3 the first
i i
Children, to
D i D_i
Is the training set,
A A g A-A_g
For the feature set, the above steps are recursively called to obtain the subtree
T i T_i

Dizzy 😏 see example much simpler ~

Example explanation

Again, the last example

Input: Training data set D

Output: decision tree TTT.

Step 1 judgment
T T
Whether feature selection is required to generate a decision tree

  • The final classification of the training set has 2 categories
  • The training focused on age, having a job, owning a home and credit status

TTT requires the selection of features to generate a decision tree

Step 2 Calculate the information gain of each feature in A, and select feature A with the maximum information gaing

We can set ε=0.001\varepsilon=0.001ε=0.001, and we can see from the above calculated information gain values are greater than ε\varepsilonε, so there is no reason to say that this is a single-node tree

The information gain of each feature is calculated according to the information gain calculated above

Therefore, we can determine whether the root node with the largest information gain has its own house, according to which the training set can be divided into D1D_1D1 and D2D_2D2

You can tell he owns his own house
D 1 D_1
They’re both agreeing to make a loan, so now you have a single leaf. Without a house of their own
D 2 D_2
There are both agree loan and disagree loan, so it is necessary to find the corresponding features of internal nodes according to feature selection (calculate the maximum information gain under these three features).

First of all, age:

It can be seen from the training set of whether they have their own house in the table above that there are 9 instances without their own house, among which 6 agree to the loan and 3 do not, then the empirical entropy is

I =1i=1i=1

I =2i=2i=2 Middle age:

And you can see that for this subset, they all disapprove of loans, so its empirical conditional entropy is zero

I =3i=3i=3

The ultimate experience condition entropy H (D2 ∣ A1) H (D_2 | A_1) H (D2 ∣ A1) as follows:

The information gain of age A1A_1A1 is:

Similarly, calculate whether there is information gain of work A2A_2A2 and credit situation A4A_4A4, and list the following table:

By contrast, the feature with the largest information gain is working, so we choose it as the next feature.

In the feature of having a job, if there is a job, all agree to the loan; If there is no work, the loan is not approved, and it is found that there is no other classification under this.

The requirement of a single leaf is satisfied, so the generation of decision tree ends here ~🎈

2. The C4.5 algorithm

If you know the ID3 algorithm, the C4.5 algorithm is super simple.

C4.5 algorithm is similar to ID3 algorithm, the difference lies in that C4.5 uses the information gain ratio to select features, and the feature with the maximum information gain ratio is selected as the optimal feature

The code for the above examples can be found at 👉 gitee.com/xin-yue-qin…

Welcome big guy to give advice ~

Resources: Dr. Jane – Decision Tree generation