1. What is decision tree

1.1 Basic idea of decision tree

In fact, the picture can better understand the fundamental difference between LR model and decision tree model algorithm. We can consider a decision problem: whether to go on a blind date, the mother of a girl should introduce a date to the girl.

Everybody see very clearly! LR models cram all the features into learning at once, whereas decision trees are more like if-else in programming languages, making conditional judgments, and that’s the fundamental difference.

1.2 Growth process of “tree”

Decision tree is based on the “tree” structure to make decisions, then we will face two problems:

  • How trees grow.
  • When does the tree grow to stop?

After understanding these two problems, the model has been established. The overall process of decision tree is the idea of “divide and conquer”. One is the recursive process from root to leaf, and the other is to find a “partition” attribute in each intermediate node, which is equivalent to a characteristic attribute. Let’s tackle each of these problems one at a time.

When does the tree stop growing

  • The samples contained in the current node all belong to the same category and need not be divided; For example, a sample of people who decided to go on a blind date belongs to the same category, that is, no matter how their characteristics change, it does not affect the results, so there is no need to divide.
  • The current attribute set is empty, or all samples have the same value in all attributes, which cannot be divided; For example, all samples have the same characteristics, which makes it impossible to divide, and the training set is too single.
  • The sample set contained in the current node is empty and cannot be divided.

1.3 How do “Trees” grow

In life, we all have to make many decisions, such as where to eat, where to buy digital products, where to visit, etc., and you will find that these choices depend on the choices made by the majority of people, that is, the choice of following the public. In fact, the decision tree is the same, when most of the samples are the same, then the decision has been made.

We can abstract the choice of the masses, which leads to the idea of purity, and if you think about it, mass choice means higher purity. Well, to go further, it involves a sentence: the lower the entropy, the higher the purity. I believe everyone is more or less all heard of the concept of entropy, the information entropy, is used as a measure of the “information” includes popular, if the sample properties are the same, can let a person feel that the information contained in this is single, no differentiation, on the contrary the properties of the sample is different, then include the amount of information a lot.

I’m going to get a headache here, because I’m going to introduce the formula for information entropy, which is actually quite simple:


Pk means: the proportion of the k-th sample in the current sample set D is Pk.

Information gain

Without further ado, go straight to the formula:

If you don’t understand, a simple word is: information entropy before partition — information entropy after partition. It’s a step in the direction of purity.

Now, with that in mind, we can begin to grow the tree.

1.3.1 ID3 algorithm

Explanation: the root node to calculate the information entropy, and then according to the attributes to differentiate its nodes and calculate the information entropy, in turn, attribute node using the root node information entropy of information entropy = information gain, according to the information gain in descending order, in front of the row is the first partition attribute, so on, it’s got the shape of a decision tree, which is how to “long”.

If you don’t understand, you can check the picture examples I share, combined with what I said, package you understand:

  1. The first picture.jpg
  2. The second picture.jpg
  3. The third picture.jpg
  4. The fourth picture.jpg

However, there is a problem with information gain: there is a preference for attributes with a large number of values, for example, consider “number” as an attribute. To solve this problem, another algorithm, C4.5, is introduced.

1.3.2 C4.5

To solve the problem of information gain, an information gain rate is introduced:


Among them:


The greater the number of possible values of attribute A (that is, the greater V), the greater the value of IV(a) will generally be. ** Information gain ratio essence: it is a penalty parameter multiplied on the basis of information gain. When the number of features is larger, the penalty parameter is smaller. When the number of features is small, the penalty parameter is large. ** has only one drawback:

  • Disadvantages: The information gain rate bias has few values.

Using information gain rate: Based on the above disadvantages, the feature with the maximum information gain rate is not directly selected. Instead, the feature with the highest information gain rate is found out from the candidate features, and then the feature with the highest information gain rate is selected from these features.

1.3.3 CART algorithm

Smart mathematicians came up with another way to measure purity, called the Gini index (pesky formula) :


Represents the probability that a randomly selected sample in the sample set will be misclassified. So, for example, if I have a bag of three different colors of balls, and I reach in and I pull out two different colors of balls, you get the idea. The smaller Gini(D) is, the higher the purity of dataset D is.

For example

Suppose there is a feature “education degree”, this feature has three values: “undergraduate”, “Master”, “Doctor”,

When the feature of “education background” is used to divide the sample set D, there are three partition values respectively, so there are three possible sets of division, and the subsets after division are as follows:

1. Division point: “undergraduate”, subset after division: {undergraduate}, {master, doctor}

2. Partition point: “Master”, subset after partition: {master}, {bachelor, doctor}

3. Partition point: “Master”, subset after partition: {doctor}, {undergraduate, master}}

For each of the above partitions, the purity of dividing sample set D into two subsets can be calculated based on partition feature = an eigenvalue:


Therefore, for A feature with multiple values (more than 2), it is necessary to calculate the purity Gini(D,Ai) of the subset after sample D is divided with each value as the dividing point (where Ai represents the possible value of feature A).

Then, from all the possible Gini(D,Ai) partitions, the minimum Gini index partition is found, and the partition point of this partition is the best partition point for sample set D with feature A. Here you can grow into a “big tree”.

1.3.4 Three different decision trees

  • ID3: an attribute with multiple values makes data more pure and its information gain larger.

    The result was a large, shallow tree: unreasonable.

  • C4.5: Information gain rate is used instead of information gain.

  • CART: Replace entropy with Gini coefficient to minimize impurity rather than maximize information gain.

2. Why doesn’t tree structure need normalization?

Because numerical scaling does not affect the splitting point position, it does not affect the structure of the tree model. Sorted according to the eigenvalues, the order of sorting remains the same, so the branches and split points will not be different. In addition, the tree model cannot carry out gradient descent, because the construction of the tree model (regression tree) is completed by looking for the optimal split point to find the best advantage, so the tree model is a step, the step point is not differentiable, and the derivation is meaningless, so there is no need for normalization.

Since tree structures (such as decision tree and RF) do not need normalization, why do non-tree structures such as Adaboost, SVM, LR, Knn and KMeans need normalization?

For the linear model, when the eigenvalues are very different, the loss contour line is elliptical when gradient descent is applied, and multiple iterations are needed to reach the optimal advantage. However, if normalization is carried out, contour lines will be circular, prompting SGD to iterate toward the origin, resulting in fewer iterations required.

3. The difference between classification decision tree and regression decision tree

Classification And Regression Tree(CART) is a kind of decision Tree. CART algorithm can be used to create both Classification Tree And Regression Tree. The process of building a tree is slightly different.

Regression tree:

CART regression tree assumes that the tree is a binary tree by constantly splitting features. For example, the current tree node is split based on the JTH eigenvalue. Let the samples whose eigenvalue is less than S be divided into the left subtree, and those larger than S are divided into the right subtree.


CART regression tree is essentially to divide the sample space in this feature dimension, and the optimization of such spatial partition is a NP-hard problem. Therefore, heuristics are used to solve it in the decision tree model. The objective function generated by a typical CART regression tree is:


Therefore, in order to solve the optimal segmentation feature J and the optimal segmentation point S, it is transformed into solving such an objective function:


So as long as we traverse all the segmentation points of all features, we can find the optimal segmentation features and segmentation points. You end up with a regression tree.

Reference article: Classical algorithm details –CART classification decision tree, regression tree and model tree

4. How does the decision tree prune

The basic Pruning strategies of decision trees include pre-pruning and post-pruning.

  • Pre-pruning: The core idea is to use the data of the verification set to verify whether partitioning can improve the accuracy of partitioning before further dividing nodes each time. If not, the node is marked as a leaf node and exits for further partition; Keep recursively generating nodes if you can.
  • Post-pruning: Post-pruning first generates a complete decision tree from the training set, and then investigates the non-leaf nodes from bottom to top. If the sub-tree corresponding to the node is replaced with a leaf node to improve generalization performance, the sub-tree is replaced with a leaf node.

Reference article: Decision tree and its generation and pruning

5. Code implementation

GitHub:github.com/NLP-LOVE/ML…

Machine Learning


Author: @ mantchs

GitHub:github.com/NLP-LOVE/ML…

Welcome to join the discussion! Work together to improve this project! Group Number: [541954936]