• By Han Xinzi @Showmeai
  • Tutorial address: www.showmeai.tech/tutorials/3…
  • This paper addresses: www.showmeai.tech/article-det…
  • Statement: All rights reserved, please contact the platform and the author and indicate the source

The introduction

Decision Tree is a classical classification and regression algorithm in machine learning. In this article we discuss the principles of decision trees for classification. The decision tree model is a tree structure. In the classification problem, a decision tree can be regarded as a set of IF-then rules. The model has the characteristics of readability and fast classification, and is widely used in various practical business modeling processes.

(this content will involve a lot of basic knowledge of machine learning, no sequence knowledge reserve first baby articles can view ShowMeAI graphic machine learning | basic knowledge of machine learning.

1. The core idea of decision tree algorithm

1) Decision tree structure and core idea

Decision tree is a common supervised classification algorithm based on known various situations (feature values) and a way to analyze by constructing tree Decision structure.

The Decision Tree model simulates the human decision-making process. Take buying clothes for example. A customer is buying pants in a shop.

Decision tree is a prediction model which represents the mapping relationship between object attributes and object values. A decision tree is a tree structure in which:

  • Each internal node represents a test of a property
  • Each branch represents a test output
  • Each leaf represents a category

As in the example of buying clothes above, the first “internal node” corresponds to the test on the property “material”, with the two branches being the possible result of the property being “cowboy” and “non-cowboy”. When the value is “jeans”, the next attribute “pants type” is tested. If the value is “non-cowboy”, it corresponds to “leaf node” — “Do not buy”.

The core of the decision tree model is as follows:

  • Nodes and directed edges.
  • There are two types of nodes: internal node and leaf node.
  • The inner node represents a feature, and the leaf represents a class.

2) The history of decision trees

In the development process of decision tree, there have been many different types of models, typical models such as ID3, C4.5 and CART, etc. Below, we will briefly introduce different models in the history of development.

2. Decision tree growth and optimal attribute selection

From the history of decision trees introduced above, you have a basic understanding of the different decision tree models. In the next section, we will take a look at how decision trees grow and form.

1) Decision tree growth process

The decision-making process of decision tree is to test the corresponding characteristic attributes of the items to be classified from the root node, and select the output branches according to their values until the leaf node, and take the classification of the leaf node as the decision result. Simply put, the overall flow of a decision tree is a recursion from root to leaf, looking for a “split or test” attribute at each intermediate node.

The pseudocode shown below is a detailed decision tree growth (build) process. You can pay special attention to the three types of termination conditions and returned results in the figure, and a very core step in the whole process is “optimal partition attribute selection”.

There are three conditions for a decision tree to stop growing:

2) Optimal attribute selection

Now let’s look at how optimal partition property selection for a decision tree works.

(1) Information entropy

Understanding the “optimal decision tree attributes” choice, we need to understand a concept of information theory “information entropy (entropy)” (mathematical foundation knowledge can reference ShowMeAI articles diagram AI | information theory), it is the information necessary to eliminate the uncertainty of measurement, and unknown event may contain the amount of information, Sample set “purity” can be measured.

Corresponding to machine learning, it is assumed that there is yyY class in the current data set DDD, and the proportion of the k-th sample is PKP_ {K} PK, then the information entropy can be calculated as follows:


E n t ( D ) = K = 1 y p k log 2 p k Ent(D) = -\sum_{K=1}^{\left | y \right | } p_{k} \log_{2}{p_{k}}

However, when the value of PKP_ {k} PK is 1, the information entropy is 0 (it is obvious that the probability 1 represents a certain event without any uncertainty). When pkp_ {k} pk is evenly distributed, take the maximum information entropy log ⁡ (∣ ∣) y \ log (| | y) log (∣ y ∣) (at this time all candidate equivalent probability, uncertainty maximum).

(2) Information gain

After we have a better understanding of Information entropy, we can further understand Information Gain, which measures the change of Information entropy when we select a certain attribute to divide (it can be understood as the reduction of uncertainty based on this rule).


Gain ( D . a ) = Ent ( D ) v = 1 v D v D Ent ( D v ) \operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{v} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)

Information gain describes how much information a feature brings to the table. In decision tree classification, information gain is the difference of information before and after attribute selection partitioning. The typical decision tree algorithm ID3 selects the attributes (features) of each node branch based on information gain.

Take the watermelon data set as an example.

  • Data set is divided into good melon, melon bad so ∣ ∣ y = 2 | | y = 2 ∣ ∣ y = 2.
  • The root node contains 17 training samples, of which 8 are good melon samples, accounting for 8/17.
  • There were 9 samples of bad melon, accounting for 9/17.

The information entropy of the root node can be obtained by substituting the data into the information entropy formula.

Take attribute “color and lustre” as an example, and its corresponding three data subsets are as follows:

  • D1 (color = green) D1 (color = green) D1 (= green color), contains,4,6,10,13,17 {1} \ left \ \} {1,4,6,10,13,17 \ right,4,6,10,13,17} {1, 6 sample, Among them, the good melon sample is {1,4,6}\left \{1,4,6 \right \} {1,4,6} with the proportion of 3/6, and the bad melon sample is {10,13,17}\left \{10,13,17 \right \} {10,13,17} with the proportion of 3/6. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

  • D2 (color = black) D2 (color = black) D2 (= dark color), contains,3,7,8,9,15 {2} \ left \ \} {2,3,7,8,9,15 \ right,3,7,8,9,15 {2}, 6 sample, The good melon sample for,3,7,8 {2} \ left \ \} {2,3,7,8 \ right,3,7,8 {2}, ratio of 4/6, bad melon sample for 9 {2} \ left \ \} {9, 15 \ right {9, 15}, ratio of 2/6. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

  • Color = D3 (plain) D3 (= plain color) D3 (= plain color), contains five,11,12,14,16} {\ left \ {5,11,12,14,16 \ right \} {5,11,12,14,16}, 5 sample, The good melon sample is {5}\left \{5 \right \} {5}, the proportion is 1/5, the bad melon sample is {11,12,14,16}\left \{11,12,14,16 \right \} {11,12,14,16}, the proportion is 4/5. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

The information gain of color attribute is:

In the same way, the information gain of other attributes is calculated as:

By comparing different attributes, we find that “texture” has the largest information gain and is selected as a partition attribute: Clear,2,3,4,5,6,8,10,15 {1} \ left \ \} {1,2,3,4,5,6,8,10,15 \ right,2,3,4,5,6,8,10,15 {1}, slightly paste,9,13,14,17} {7 \ left \ { 7,9,13,14,17 7,9,13,14,17 \ right \} {}, fuzzy 11,12,16} {\ left \ \} {11,12,16 \ right,12,16} {11.

Taking the next step, let’s look at the branch of nodes where “texture” = “clear”, The node containing the number of D1 sample set for {1,2,3,4,5,6,8,10,15} \ left \ \} {1,2,3,4,5,6,8,10,15 \ right,2,3,4,5,6,8,10,15 {1} a total of 9 sample, The available attribute set is {color, root, tap, umbilical, tactile}\left \{color, root, tap, umbilical, tactile}\ right \} {color, root, tap, umbilical, tactile} (now “texture” is no longer used as a classification attribute), we calculate the information gain of each attribute in the same way:

It can be seen from the above figure that the three attributes of “root”, “umbilical” and “touch” have obtained the maximum information gain, and any one of them can be used as the classification attribute. Similarly, the final decision tree can be obtained by performing similar operations on each branch node.

(3) Information Gain Ratio

You’ve seen information gain as a feature selection method, but the problem with information gain is that it favors features with more values. The reason is that when there are many values of features, it is easier to divide according to the features to get a subset with higher purity, so the entropy after division is lower, because the entropy before division is certain. Therefore, the information gain is larger, so the information gain tends to be more characteristic.

Is there a solution to this little problem? Yes, this is why we need to mention the Gain Ratio. Compared with the information Gain, the information Gain Ratio has a part measuring the dispersion degree of its own attributes as the denominator, and the famous decision tree algorithm C4.5 uses it as the principle of dividing attribute selection.

The calculation details of information gain rate are as follows:


Gain ratio ( D . a ) = Gain ( D . a ) IV ( a ) \operatorname{Gain}_{-} \operatorname{ratio}(D, a)=\frac{\operatorname{Gain}(D, a)}{\operatorname{IV}(a)}

I V ( a ) = v = 1 V D v D log 2 D v D IV(a)=-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \log _{2} \frac{\left|D^{v}\right|}{|D|}

In addition to the entropy-dependent definition above, we can also use the Gini index to indicate the impurity of a data set. The larger the Gini index, the less pure the dataset.

The detailed calculation method of Gini Index is as follows:


Gini ( D ) = k = 1 y k indicates k p k p k = 1 k = 1 y p k 2 \operatorname{Gini}(D)=\sum_{k=1}^{|y|} \sum_{k \prime \neq k} p_{k} p_{k^{\prime}}=1-\sum_{k=1}^{|y|} p_{k}^{2}

Among them, PKP_KPK represents the proportion of the KKK class data to the total number of data. The famous decision tree algorithm CART uses Gini index to select attributes (of course, CART itself is a binary tree structure, which is different from ID3 and C4.5 above).

One way of thinking about the Gini index, and the reason why it can be used as a measure of purity, is you can imagine if you touch a ball in a dark bag, and there are different colored balls, and the proportion of the KTH balls is called PKp_kpk, and the probability of getting a KTH ball both times is pk2p_K ^ 2PK2, The probability of the two balls being different colors is 1− σ PK21 – σ P_K ^21− σ PK2, and the lower the value, the lower the probability of the two balls being different colors, and the higher the purity.

3. Overfitting and pruning

If we let the decision tree grow forever, we might end up with a decision tree that is huge and has problems fitting because we have learned too much from the raw data. Decision tree overfitting can be alleviated by pruning. And pruning way can be divided into: pre – pruning and after pruning.

1) Decision tree and pruning operation

In order to classify training samples as correctly as possible, there may be too many branches and over-fitting. Overfitting refers to good performance on the training set, but poor performance on the test set, poor generalization performance. The risk of overfitting can be reduced by pruning some branches proactively, and the “set-aside method” can be used to evaluate the merits and demerits of the decision tree before and after pruning.

The basic strategies include “pre-pruning” and “post-pruning” :

  • Pre-pruning: In the growth process of decision tree, each node is estimated before partitioning. If the partitioning of the current node fails to improve the generalization performance of decision tree, it stops partitioning and marks the current node as leaf node.

  • Post-pruning: a complete decision tree is generated from the training set first, and then non-leaf nodes are investigated from bottom to top. If replacing the subtree corresponding to the node with a leaf node can improve the generalization performance of the decision tree, the subtree is replaced with a leaf node.

2) Pre-pruning and post-pruning cases

As an example, the following data set is partitioned as a validation set to evaluate the performance of the decision tree model.

A complete decision tree generated on the above watermelon data set is shown in the figure below.

(1) pre-pruning

The “pre-pruning” process is as follows: it is marked as a leaf node, and the category is marked as the largest category in the training sample.

  • If “good melon” is selected, the verification set {4,5,8} is correctly classified, and the accuracy of the verification set is 3/7×100%=42.9%

  • According to the training sample of node ②, ③ and ④, these three nodes were marked as “good melon”, “good melon” and “bad melon” respectively. At this point, the samples numbered {4,5,8,11,12} in the validation set are correctly divided, and the accuracy of the validation set is 5/7×100%=71.4%

    • Node 2 (good melon) : classification correct: {4,5}, classification error: {13}
    • Node 3 (good melon) : Correctly classified: {8}, incorrectly classified: {9}
    • Node 4 (bad melon) : classification correct: {11,12}

If the accuracy of the verification set after partition decreases, the partition is rejected. The pruning judgment is carried out on nodes ②③④ respectively. The division of nodes ②③ is forbidden, and node ④ itself is leaf node.

According to the pre-pruning method, a layer of decision tree is generated here. The result is a decision tree with only one layer, called decision stump.

(2) post pruning

We post prune on the generated complete decision tree:

  • The decision tree is evaluated with data from the validation set. The sample {4,11,12}\left \{4,11,12 \right \}{4,11,12} is correctly classified. And the sample 5,8,9,13} {\ left \ {5,8,9,13 \ right \} {5,8,9,13} classification error, the accuracy of 42.9%.

  • When the decision tree is post-pruned, the node ⑥ is marked as a good melon, and the sample {4,8,11,12}\left \{4,8,11,12 \right \}{4,8,11,12} is correctly classified. Sample 5,9,13} {\ left \ {5,9,13 \ right \} {5,9,13} classification error, accuracy of 57.1%.

After pruning, the accuracy is improved, so the decision tree needs to be pruned at the node ⑥.

Considering node ⑤, if it is replaced by leaf node, it is marked as “good melon” according to the training sample {6,7,15}\left \{6,7,15 \right \} {6,7,15} {6,7,15} {6,7,15}. The accuracy of the verification set is still 57.1%, and pruning is not necessary.

Consider node ②, if it is replaced by leaf node, mark it as “good melon” according to the training sample {1,2,3,14, right \} {1,2,3,14} on it, the accuracy of verification set is improved to 71.4%, and decide to prune.

For nodes ③ and ①, if the sub-tree is replaced by leaf node, the accuracy distribution of the verification set of the decision tree obtained is 71.4% and 42.9%, both of which are not improved, so pruning is not required. The final decision tree after pruning is obtained.

3) Characteristics of pre-pruning and post-pruning

Time cost:

  • Pre-pruning: training time cost reduced, test time cost reduced.
  • Post pruning: training time cost increased, test time cost decreased.

Over/under fitting risk:

  • Pre-pruning: the risk of over-fitting is reduced and the risk of under-fitting is increased.
  • Post-pruning: over-fitting risk is reduced, under-fitting risk is basically unchanged.

Generalization performance: Post-pruning is usually superior to pre-pruning.

4. Processing of continuous and missing values

1) Continuous value processing

The data we used for learning includes both continuous value features and discrete value features. The previous examples used discrete value attributes (features). Of course, the decision tree can also process continuous value attributes.

For the features of discrete values, the division method of decision tree is to select the most appropriate feature attribute, and then divide the set into multiple sub-sets according to the different values of this feature attribute, and repeat this operation process continuously.

For continuous value attributes, it is obvious that we cannot directly disperse the set with these discrete values, otherwise each continuous value would correspond to a classification. So how do we involve continuous valued attributes in the establishment of a decision tree?

Since the number of values of continuous attributes is no longer limited, continuous attributes need to be discretized. The commonly used discretization strategy is dichotomy, which is also used in C4.5.

The specific dichotomy is shown in the figure below:

Note: Different from discrete attribute, if the current node partition attribute is continuous attribute, this attribute can also be used as the partition attribute of its descendant node.

2) Missing value processing

The original data often have missing values, and the decision tree algorithm can also effectively process the data with missing values. When using decision tree modeling, two problems need to be solved to deal with missing values:

  • Q1: How to select partition attributes?

  • Q2: Given a partition attribute, if the value of the sample on the attribute is missing, how to divide?

The basic idea of missing value processing is: weight of samples, weight division. Let’s take a look at the missing values in the watermelon data set below.

Only the samples with no missing values are used to judge the merits of partition attributes. At the beginning of learning, the root node contains all 17 samples in sample set D, and the weight is 1.

  • When the “color” attribute is selected for the root node, there are 3 missing values, so the total number of samples is 14.

  • The good melon sample for,3,4,6,7,8 {2} \ left \ \} {2,3,4,6,7,8 \ right,3,4,6,7,8 {2}, ratio of 6/14, Melon bad sample for 9,10,11,12,14,15,16,17} {\ left \ {9,10,11,12,14,15,16,17 \ right \} {9,10,11,12,14,15,16,17}, ratio of 8/14.

The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

Let D1~\tilde{D^{1}} D1~, D2~\tilde{D^{2}} D2~, D3~\tilde{D^{3}} D3~ represent the subset of samples whose attribute “color” is “green”, “black” and “light white” respectively:

  • D1 ~ (= green color) \ tilde {D ^ {1}} [color = green] D1 ~ (= green color), contains,6,10,17 {4} \ left \ \} {4,6,10,17 \ right,6,10,17} {4, 4 sample, Among them, the good melon sample is {4,6}\left \{4,6 \right \} {4,6}, with the proportion of 2/4, and the bad melon sample is {10,17}\left \{10,17 \right \} {10,17}, with the proportion of 2/4. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

  • D2 to [color = black) \ tilde ^ D {{2}} [color = black) D2 ~ (= dark color), contains,3,7,8,9,15 {2} \ left \ \} {2,3,7,8,9,15 \ right,3,7,8,9,15 {2}, 6 sample, The good melon sample for,3,7,8 {2} \ left \ \} {2,3,7,8 \ right,3,7,8 {2}, ratio of 4/6, bad melon sample for 9 {2} \ left \ \} {9, 15 \ right {9, 15}, ratio of 2/6. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

  • D3 ~ (= plain color) \ tilde ^ D {{3}} [color = prompt reply) D3 ~ (= plain color), containing 11,12,14,16} {\ left \ \} {11,12,14,16 \ right,12,14,16} {11, 4 sample, The good melon example has {ϕ}\left \{\phi \right \} {ϕ}, with a ratio of 0/5, and the bad melon example has {11,12,14,16}\left \{11,12,14,16} {11,12,14,16}, with a ratio of 4/4. The information entropy of the node can be obtained by substituting the data into the information entropy calculation formula.

Therefore, the information Gain of attribute “color” on sample set D can be calculated, and Gain(D, texture)=0.424 is the maximum information Gain, so “texture” is selected as the next partitioning attribute.

More supervised learning algorithm model summary articles can view ShowMeAI AI knowledge skill quick | machine learning – learning supervision.

Video tutorial

You can click on theB stationCheck out the [bilingual subtitles] version of the video

MIT bilingual + data download 】 【 6.036 | introduction to machine learning (2020 · complete version)

www.bilibili.com/video/BV1y4…

ShowMeAI related articles recommended

  • 1. Machine learning basics
  • 2. Model evaluation methods and criteria
  • 3.KNN algorithm and its application
  • 4. Detailed explanation of logistic regression algorithm
  • 5. Detailed explanation of naive Bayes algorithm
  • 6. Detailed explanation of decision tree model
  • 7. Detailed explanation of random forest classification model
  • 8. Detailed analysis of regression tree model
  • 9. Detailed explanation of GBDT model
  • 10. The XGBoost model is fully resolved
  • 11. Detailed explanation of LightGBM model
  • 12. Detailed explanation of support vector machine model
  • 13. Detailed explanation of clustering algorithm
  • 14.PCA dimension reduction algorithm in detail

ShowMeAI series tutorials recommended

  • Illustrated Python programming: From beginner to Master series of tutorials
  • Illustrated Data Analysis: From beginner to master series of tutorials
  • The mathematical Basics of AI: From beginner to Master series of tutorials
  • Illustrated Big Data Technology: From beginner to master
  • Illustrated Machine learning algorithms: Beginner to Master series of tutorials