Alitao – Poirot
As one of the most classical machine learning algorithms, decision tree (including its derivative or algorithm with its minimum decision unit, including random forest, slope climbing tree, XGBoost, etc.) still plays an important role in the family of supervised learning algorithms in today’s popular neural network.
It is a machine learning prediction model that provides decision making ability based on tree model. Each node in the tree represents an object, and the corresponding bifurcation path represents its attribute value, and the leaf node represents the value of the object represented by the path from the root node to the leaf node. The following will illustrate the basic principles and principles of decision tree and demonstrate how a basic decision tree model works through a classic example.
1. Empirical Risk Minimization
For supervised learning tasks, there are usually the following training sets and objectives: Training set: S = {(X1, y1)… , (Xn, yn)} Target: h: X->y mapping, can have a good performance in the actual task T. (H hypothesis)
We don’t know how the model will perform in a real-world scenario. But we typically verify the performance of the model on a labeled training set. The ultimate goal of the machine learning algorithm is to find one of the hypotheses on a set of hypotheses that has the minimum mean loss function on the data set:
**R(h) = E(L(h(x), y)) E represents expectation, and L represents loss function **
All this assumes, of course, that the training set is sufficiently representative of the real situation. It is well known that training sets are often unrepresentative of the real world, resulting in an overfitting of the model, similar to the difference between a problem set with answers and a final exam.
2. Decision Tree Algorithm principle
A decision tree is an Directed Acyclic Graph, in which each non-leaf nodes is associated with an attribute, and edges represent the value of the parent node of the attribute. Leaf nodes are associated with the tag. Here we can use a very classic example to understand the basic principles of the decision tree algorithm. Here is a training set of weather conditions and whether to play tennis or not (this is a 4-dimensional input and 1-dimensional output).
In this data set, Outlook has Sunny, Overcast and Rain states, Temp has Hot, Mild and Cold states, and Taiwan has High and Normal states, while Wind has Weak and Strong states. Therefore, there are 36 combinations of 3x3x2x2=, and each combination corresponds to two predicted results of Yes and No. On this topic, if we want to build a decision tree model, there are 236 different hypotheses. This part of the hypotheses hypotheses is the most suitable part.
2.1 Ockham’s Razor
In the previous section, we pointed out that there are 236 different decision tree models for this training set. Among these possible models, how many of them can achieve completely correct prediction models on the 14 samples (the samples are not contradictory)? The answer is 2(36-14) = 222.
William of Occam, a 14th-century Franciscan monk, formulated a classic law of logic: If not necessary not by entity. In other words, everything should be as simple as possible. This theory can be combined with the empirical risk minimization mentioned earlier to form the cornerstone of decision tree construction design. Since this is an NP-hard problem, we can use a Greedy procedure-recursive strategy here, starting at the root stage, to pick the most succinct of the 222 models that will do the trick:
- Select a good attribute as a node.
- According to the value of this attribute (for example, the value of Wind is Strong, Weak), the edge is split into different child nodes.
- If all samples have the same label, this node ends here. (In other words, if all samples are labeled the same, and one label can be used to summarize the samples, there is no need to continue to divide)
2.2 Gini impurity and cross entropy loss
In the growth strategy of decision tree, we mentioned that every time we need to select a suitable attribute as a node, in fact, what we need to do is to maximize the differentiation of tags by the descendants of this node (reduce the randomness of classification). For example, attribute A grows along the values 0 and 1, and the samples on the child node are labeled (4 Yes, 4 No) and (3 Yes, 3 No), respectively. Attribute B grows along the values 0 and 1, and the sample labels on the child nodes are (8 Yes, 0 No) and (0 Yes, 6 No) respectively. So it’s obvious that choosing node B would be more discriminating. Generally, there are two kinds of indexes to quantify such differentiation, namely Gini Impurity and Information Gain. In principle, the decision tree should grow along the direction with the fastest decreasing Gini index or cross entropy loss.
If the number of labels is distributed as follows: N1, n2,…… , n****L
The cross entropy loss can be described as:
H(S) = (n1/n) * log(n/n1) + (n2/n) * log(n/n2) +… + (nL/n) * log(n/nL)
Through the difference between the cross entropy loss of the parent node (S) and the weighted cross entropy loss of the child node (v), we can obtain the information gain:
IG(S) = H(S) – (n1/n)H(v1***) – (n2/n)H(v2**) – …… – (nL**/n)H(vL****)
Similarly, the Gini index can be described as:
**Gini(S) = 1 – (n1/n)2 – (n2/n)2 – ……. – (nL/n)**2
Similarly, Gini impurity can be obtained from the difference between the gini index of the parent node (S) and the weighted Gini index of the child node (v) :
GiniImp(S) = Gini(S) – (n1/n)Gini(v1***) – (n2/n)Gini(v2**) – …… – (nL**/n)Gini(vL****)
If we use the tennis ball example, then a complete decision tree with gini impurity as its decision method would grow like this. First, at the root node, when No attribute is used, the training set can be divided into (9 Yes, 5 No), and the corresponding Gini index is as follows:
Respectively calculate the Gini unpurity of the attributes of the four dimensions as the growth index of the decision tree:
It is clear that the Outlook attribute has the highest gini impurity, meaning that choosing this dimension to grow at this time can better distinguish the training set.
We can observe that when Outlook value is Overcast, the sample is (4 Yes, 0 No), which means that all the branches are correctly classified and can be terminated. If Sunny attribute is calculated as OS and Rain attribute is calculated as OR, the following Gini impurity can be calculated after outoutlook – Sunny is selected:
Obviously, the choice of Humidity can minimize the randomness, so the child node of outlooksunny is Humidity. At the same time, we observe that all samples can be correctly classified and this branch can be terminated.
Similarly for outlookrain, it can be calculated as follows:
Obviously, selecting Wind as the child node can minimize the randomness to the greatest extent. Meanwhile, we observed that subsequent samples could be correctly classified, so this branch ended. To sum up, we can obtain a decision tree with the following structure, which is the optimal solution generated by using recursive greedy strategy (guided by ERM and Ockham’s Razor principle) :
3. Overfitting is inhibited
Overfitting suppression strategies of decision trees can be divided into Prepruning and Postpruning. Pruning usually stops the growth of trees in advance according to preset rules (such as the maximum number of nodes, the maximum height, gini impurity or the increase of information gain, etc.). Post pruning is pruning after the tree is fully grown, such as increasing the penalty for loss according to complexity. In post pruning, branches that are not efficient for gini purity or information gain are removed.