Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.
In the future, I will sum up my four years in college. Although IT was not perfect, I really tried my best to do everything I wanted to do.
Algorithm thought
A decision tree is not just a binary tree, it can have multiple forks.
In short, it means to judge layer by layer to get to the end result.
🥕 Tree structure
The ellipse is the inner node, representing the feature or the property, and the top node is the root node, where all the forks start
The rectangle is the leaf node and represents the class.
If-then rules:
Information gain
Decision tree is based on the characteristics of each node to judge layer by layer to get the final classification result, and the emergence of information gain is to find the optimal feature, how to order each feature to make the final judgment result is the most reasonable.
1. Concept of entropy
Information gain is constructed from entropy. Entropy refers to the uncertainty of random variables, and the greater the uncertainty, the greater the entropy.
Since entropy is related to the distribution of random variables, we can get:
When the values of random variables are equally distributed, the corresponding entropy is maximum. That is to say, when all the values of features are the same, the information contained is the maximum, which is the maximum uncertainty.
2. Information gain
Input: Training data set D and feature A.
Output: information gain G (D,A) of feature A to training data set D.
🚶♂️ three steps ~
Step 1 Calculate empirical entropy H(D) of data set D
Step 2 computing features A condition of the experience of the data set D entropy H (D | A)
Step 3 Calculate the information gain formula
The information gain represents the degree to which the uncertainty of the information of class Y is reduced by knowing feature A.
There’s a lot of formulas. Confused? 😅 see example immediately clear ~
3
Input: Training data set D
Output: The optimal feature is selected by calculating the information gain
Step 1 Calculate the empirical entropy formula
In this training data set, D=15, and there are two final classifications: yes (9) and No (6).
The empirical entropy is obtained by substituting it into the calculation:
Step 2 Calculate the empirical conditional entropy formula
Let A1: age, A2: whether you have a job, A3: whether you own a house, A4: credit situation
(1) age
The empirical conditional entropy of youth, middle age and old age can be obtained by substitution calculation:
I =1 Youth:
I =2 Middle-aged:
I =3 Old age:
Final empirical conditional entropy:
Step 3 Calculate the information gain formula
Similarly, we can calculate the information gain of other features
Continue with the above three steps and summarize the results into a table:
It can be seen that the characteristic of having one’s own house corresponds to the minimum entropy of empirical conditions and the maximum information gain, which means that if this feature is selected, the corresponding uncertainty is the least and classification selection is the most clear, so it can be set as the optimal feature 👍
Careful friend, of course, will find that if different characteristics in the classification number is different, a little is three, such as age, young, middle-aged and older), have a plenty of 2, such as whether to have his own house, classification number from time to tome can calculate the information gain more will be bigger, it can be seen that information gain value will be more inclined to more features.
🤔 How can I reduce the effect of having more values?
Don’t worry, the information gain ratio is coming
Information gain ratio
Entropy of training data set D on feature A:
Information gain ratio:
So how do we calculate HA(D), H_A(D), HA(D)?
Take characteristic A1= age A_1= age A1= age as an example:
Similarly, we continue to calculate the information gain ratio corresponding to A2, A3 and A4 features, listed in the table:
It can be seen from the perspective of information gain that it is most appropriate to choose credit status as the optimal feature by comparing the two features of having a job and credit status. From the Angle of information gain ratio, it is more appropriate to choose the job as the optimal feature
To sum up, the information gain tends to favor features with more values, and the information gain ratio tends to favor features with fewer values
Decision tree generation algorithm
1. The ID3 algorithm
Input: training data set DDD, feature set AAA, threshold ε\varepsilonε
Output: decision tree TTT
Step 1 judgment
Whether feature selection is required to generate a decision tree
Case 1 If all instances in DDD belong to the same class, T is the single-node number, CkC_kCk is recorded as the class mark of this node, and TTT is returned
Case 2 If all instances in DDD have no ∅A=\emptysetA=∅), then TTT is A single-node tree, recording the category with the most instances in DDD, and CkC_kCk uses this as the class token for that node by majority vote, and returns TTT.
Step 2 Otherwise, calculate the information gain of each feature in A and select feature A with the maximum information gaing
Case 1 If the information gain of AgA_gAg is less than ε\varepsilonε, then TTT is a single-node tree, recording the category CkC_kCk with the largest number of instances in DDD.
Case 2 Otherwise, according to each possible value of AgA_gAg aiA_iai, divide DDD into several non-empty subsets DiD_iDi, mark the category with the largest number of instances in DiD_iDi, construct child nodes, construct TTT from node and its children, and return TTT
Step 3 the first
Children, to
Is the training set,
For the feature set, the above steps are recursively called to obtain the subtree
Dizzy 😏 see example much simpler ~
Example explanation
Again, the last example
Input: Training data set D
Output: decision tree TTT.
Step 1 judgment
Whether feature selection is required to generate a decision tree
- The final classification of the training set has 2 categories
- The training focused on age, having a job, owning a home and credit status
TTT requires the selection of features to generate a decision tree
Step 2 Calculate the information gain of each feature in A, and select feature A with the maximum information gaing
We can set ε=0.001\varepsilon=0.001ε=0.001, and we can see from the above calculated information gain values are greater than ε\varepsilonε, so there is no reason to say that this is a single-node tree
The information gain of each feature is calculated according to the information gain calculated above
Therefore, we can determine whether the root node with the largest information gain has its own house, according to which the training set can be divided into D1D_1D1 and D2D_2D2
You can tell he owns his own house
They’re both agreeing to make a loan, so now you have a single leaf. Without a house of their own
There are both agree loan and disagree loan, so it is necessary to find the corresponding features of internal nodes according to feature selection (calculate the maximum information gain under these three features).
First of all, age:
It can be seen from the training set of whether they have their own house in the table above that there are 9 instances without their own house, among which 6 agree to the loan and 3 do not, then the empirical entropy is
I =1i=1i=1
I =2i=2i=2 Middle age:
And you can see that for this subset, they all disapprove of loans, so its empirical conditional entropy is zero
I =3i=3i=3
The ultimate experience condition entropy H (D2 ∣ A1) H (D_2 | A_1) H (D2 ∣ A1) as follows:
The information gain of age A1A_1A1 is:
Similarly, calculate whether there is information gain of work A2A_2A2 and credit situation A4A_4A4, and list the following table:
By contrast, the feature with the largest information gain is working, so we choose it as the next feature.
In the feature of having a job, if there is a job, all agree to the loan; If there is no work, the loan is not approved, and it is found that there is no other classification under this.
The requirement of a single leaf is satisfied, so the generation of decision tree ends here ~🎈
2. The C4.5 algorithm
If you know the ID3 algorithm, the C4.5 algorithm is super simple.
C4.5 algorithm is similar to ID3 algorithm, the difference lies in that C4.5 uses the information gain ratio to select features, and the feature with the maximum information gain ratio is selected as the optimal feature
The code for the above examples can be found at 👉 gitee.com/xin-yue-qin…
Welcome big guy to give advice ~
Resources: Dr. Jane – Decision Tree generation