Regression and classification

There are two kinds of problems we encounter in machine learning all the time, one is regression problem and the other is classification problem. It’s very easy to see, literally, that the classification problem is basically dividing the data that we have into categories, and then dividing the new data according to the categories that we have; The regression problem is to fit the existing data into a function and predict the new data according to the fitted function. The difference is the type of output variable. Regression is the quantitative output, or the prediction of continuous variables; Classification problem book quantitative output, prediction of discrete variables. Post a picture I saw on Zhihu, which explains it very well:

How to distinguish classification from regression, look not at the input, but at the output. For example, predicting whether it will be sunny or rainy tomorrow is a classification problem, while predicting how much tomorrow will be is a regression problem.

Regression tree and classification tree

In decision trees, there are also the concepts of regression tree and classification tree. In the difference between the two, the regression tree uses the maximum mean square error to divide nodes, and the mean value of each node sample is used as the regression predicted value of test samples. The classification tree uses information gain or information gain ratio to divide nodes, and the category of each node sample is voted to determine the category of the test sample. We can see that the difference between the two mainly lies in the way of division and working mode. Regression tree adopts the method of maximum mean square error (MSE) to accurately process data and output continuous variables, which can better predict our data. The classification tree uses a very broad information gain variable to better grasp the classification of the data set as a whole.

Information gain

That’s the definition of information gain. It may be simple enough to understand that a new node and the original system have mutual information. In other words, considering one more eigenvalue can provide more information for the whole decision-making process. However, we must pay attention to the calculation process of information gain. Here I have posted examples from Statistical Learning Methods for your reference:

In his book the Beauty of Mathematics, Dr. Wu Jun warned that the only way to better determine whether something is right or wrong is to introduce more information. For this decision tree, of course, we tend to choose features with greater self-uncertainty, because the greater the self-uncertainty, the more information will be introduced once it is determined. That way, the judgment is more effective for the system as a whole.

Information gain ratio

Information gain ratio, some books are also called information gain ratio. Data gain is much less than information gain. The difference is the total entropy divided by the information gain. What are the specific benefits? We will mention them in the C4.5 algorithm below. But once again, the most important thing that we need to pay attention to is the calculation of the information gain ratio, and we’ve already calculated the information gain, and the total entropy is the same as H(D). And then we divide and we get what we want.


ID3 algorithm

The core of ID3 algorithm is to select features by information gain on each node in the decision tree. When ID3 algorithm generates trees, the information gain of all alternative features is calculated first, and then the feature of the next node is selected.

In the definition and formula of the information gain, we must pay attention to the H(D) algorithm. We solve for the entropy of an entire set, generally solving for the information gain directly affects what nodes we choose to operate on for the next feature, and we set a threshold, e, and stop the cycle when the information gain is less than this threshold. In order not to affect your understanding, I write the algorithm completely here:

ID3: Input: training data set D, feature set A, threshold e; Output: decision tree T 1. If all instances of D in China belong to the same class Ck, T is a single-node tree, and Ck is taken as the class mark of this node, and T is returned; 2. If A= empty set, T is A single-node tree, and the class Ck with the largest number of instances in D is taken as the class mark of this node, and T is returned; 3. Otherwise, calculate the information gain of each feature in A to D (see my previous blog on decision tree basis for details). Select the feature Ag with the maximum information gain; 4. If the information gain of Ag is less than the threshold e, T is set as a single-node tree, and the class Ck with the largest number of real cases in D is taken as the class mark of this node, and T is returned. 5. Otherwise, for every possible value of Ag ai; According to Ag=ai, D is divided into several non-empty subsets Di. The class with the largest number of real cases in Di is used as a marker to construct the child node. The tree T is formed by the node and its child nodes, and T is returned. 6. For the ith child node, with Di as the training set and A-{Ag} as the feature set, recursively call (1) ~ (5) to get the subtree Ti and return Ti.

To deepen your understanding, I will post an example from Statistical Learning Methods by Li Hang:

ID3 algorithm only generates trees, and does not control the fitting degree or cut branches, so the tree generated by this algorithm is easy to produce over-fitting.


C4.5 algorithm

The C4.5 algorithm is very similar to the ID3 algorithm and is an improvement on it. The only difference with ID3 is that C4.5 uses information gain ratio instead of information gain to select features.

Post C4.5 algorithm:

Input: training data set D, feature set A, threshold e; Output: decision tree T

  1. If all instances in D belong to the same class Ck, T is a single-node tree, Ck is taken as the class mark of this node, and T is returned.
  2. If A= empty set, T is A single-node tree, and the class Ck with the largest number of instances in D is taken as the class mark of this node, and T is returned.
  3. Otherwise, calculate the information gain of each feature in A to D (see my previous blog on the basis of decision trees for details). Select the feature Ag with the maximum information gain;
  4. If the information gain of Ag is less than the threshold e, T is set as a single-node tree, and the class Ck with the largest number of real cases in D is taken as the class mark of this node, and T is returned.
  5. Otherwise, ai for every possible value of Ag; According to Ag=ai, D is divided into several non-empty subsets Di. The class with the largest number of real cases in Di is used as a marker to construct the child node. The tree T is formed by the node and its child nodes, and T is returned.
  6. For the ith child node, Di is taken as the training set and A-{Ag} as the feature set. Recursively call (1) ~ (5) to get the subtree Ti and return Ti.

Why do we say that C4.5 is an improvement over ID3? So we start with the difference between the two which is the information gain versus the information gain ratio. Let’s paste the definition of information gain and information gain ratio again:

By comparison, we can see that information gain is the mutual information between features and training sets, or the difference between the uncertainty of the original data set and the uncertainty after determining one of the features, which is called information gain. That is, the amount of information that this feature introduces. The information gain ratio is the ratio of this mutual information to the uncertainty of D.

When we have a property that has a lot of values, like the variety of watermelon, there are dozens of varieties, so there are dozens of values. Through calculation, we can find that the uncertainty of such attributes is greater than those with only one or two values (because many values make the whole data set very chaotic), so the computer preferentially selects the attributes with many values when processing.

Therefore, ON the basis of ID3, C4.5 adopts the judgment method of information gain ratio to punish attributes with many values.