Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

A case in point

When we are looking for a job, for example, if we want to understand a company the right company, first of all, we will care about the salary, if more than their bottom line or above now receive wages, will continue to ask whether to double cease, and the situation of the five social insurance and one housing fund payment, these problems can be seen as you want to make the final decision. What we’re doing is we’re building a decision tree.

  • Decision tree definition
  • The objective function
  • Overfit and prune
  • Advantages and disadvantages of decision trees
  • Integrated learning

Recently, I want to talk about the decision tree, because there is a lot of design content, so I take some time to keep updating, and I will keep updating and improving the finished article, I hope you can give me more support.

The decision tree

We classify data according to data features as nodes, and the possible value of the feature is the value of the classification. Each branch is a branch, and there are feature nodes or leaf nodes under each branch. If it is a leaf node, it cannot be divided.

  • Nodes understand features
  • The eigenvalues of the edges
  • Leaf node (LEAF) can understand annotation

Decision trees can be regression or classification, mainly classification. Decision trees are supervised learning.

The objective function of the decision tree

If you select features, which features are selected for partitioning, and the order in which those features are selected.

The information entropy

One of the most common ways to measure the purity of a sample set.


H ( X ) = i = 0 n p i log p i H(X) = – \sum_{i=0}^n p_i \log p_i

Information gain

For the set composed of only two types of data samples, its entropy can be expressed as follows. In other words, when there are the same number of these two types of data samples, its entropy value is the maximum, that is, the degree of chaos is the maximum.


H ( X ) = p log p ( 1 p ) log ( 1 p ) H(X) = -p \log p – (1-p)\log(1-p)