This article originated from personal public account: TechFlow, original is not easy, for attention


Today is the 23rd article in our machine learning series, and we are going to share the CART algorithm, one of the top ten data mining algorithms.

The full name of CART algorithm is Classification and regression tree, also known as Classification regression tree. Like ID3 and C4.5, CART algorithm is also a classic implementation of decision tree model. Decision tree this model has a total of three ways to achieve, the front we have introduced ID3 and C4.5 two, today just make up the last one.


Algorithm characteristics


CART is called a classification regression tree, and as the name suggests, it supports both classification and regression. Yes, decision trees do support regression, but we don’t normally use decision trees for regression. There are many reasons for this. In addition to the limited fitting ability of tree model, it is also related to the mode of features, and tree regression model is greatly affected by features. We won’t go into too much depth on this, but we will discuss it in detail later in the article on regression trees.

Because the effect of regression tree model is not ideal, the decision tree of CART algorithm is basically used to do classification problems. So in the classification problem, it is different from the previous ID3 algorithm and C4.5 algorithm?

The first point is that THE CART algorithm uses Gini index instead of information gain as the basis for dividing molecular tree. The second point is that the CART algorithm always divides the whole data into two parts instead of multiple parts when dividing data. Since CART splits data into two parts at a time, it has no limit on the number of times it can split data, while C4.5 has a limit on features, limiting each feature to be used only once at most. Because of this, CART also has a higher requirement for pruning, because not pruning is likely to lead to over-expansion of the tree, leading to over-fitting.


Gini index


In ID3 and C4.5 algorithms, information gain and information gain ratio are used when splitting data, both of which are based on information entropy model. The information entropy model itself is not a problem, it is also a very common model. The only problem is that calculating entropy involves doing log operations, which takes a lot longer than doing all four operations.

Gini index is also based on information entropy model in essence, but we have made some transformation in the calculation, so as to avoid using log for calculation and accelerate the calculation process. The internal logic is the same. So how do you realize the accelerated computation? Taylor expansion in advanced mathematics is used here. Log operation is expanded by Taylor formula and converted into polynomial calculation, so as to accelerate the calculation of information entropy.

Let’s do a simple derivation:


We put theSubstitute in, we can get:Where o(x) is a higher-order infinitesimal with respect to x. Let’s put this into the formula for information entropy:


This is the calculation formula of Gini index, where PI represents the probability of category I, which is actually the proportion of the sample of category I to the total sample. So you can view this as the probability of taking two different samples from the data set.

Therefore, the smaller Gini index is, the more concentrated the data set is, that is, the higher the purity is. Its concept is equivalent to information entropy. The smaller the entropy is, the more concentrated the information is. The two concepts are very similar. Therefore, when we use Gini index as the basis for division, we choose the segmentation method with the smallest Gini index after segmentation, rather than the largest one.

From the above formula, we can find that compared with log operation of information entropy, Gini index can get results by simply calculating the proportion and basic operation, which is obviously much faster. And because it is approximated by Taylor expansion, the overall performance is not bad, we can take a look at the following classic graph to feel:

As can be seen from the figure above, Gini index is very close to the effect of information entropy, which can also reflect the purity of data division very well.


Split and prune


As mentioned earlier when we introduced the features of the CART algorithm, the CART algorithm splits data dichotomically each time, much like the logic used in C4.5 to deal with continuous features. The first is that CART does this for both discrete and continuous features, and the other is that one feature in the CART algorithm can be used repeatedly.

For example, in the previous algorithm, for example, the diameter of a watermelon was a feature. Then, when we judge that the diameter of watermelon is less than 10cm, the diameter of watermelon will be removed from the data and will never be used again. However, this is not the case in CART algorithm. For example, when we split the data according to the diameter of watermelon and whether watermelon has vine or not, for ID3 and C4.5 algorithms, the diameter of watermelon can no longer be used as the basis of division, but CART algorithm can. We can still use the features we’ve already used.

Let’s show it with a picture that looks something like this:

If we look at the left-most subtree, the diameter feature shows up more than once, which actually makes sense. However, this also has a problem, because there is no feature can only be used once limit, this will cause the tree to expand infinitely, especially in the case of many continuity features, it is easy to fall into overfit. To place the overfit and increase the generalization capability of the model, we need to prune the resulting tree.

There are two mainstream pruning programs, one is pre-pruning, one is post-pruning. The so-called pre-pruning is to restrict the growth of the tree at the time of tree formation to prevent over-fitting. Post pruning is the pruning of the fitted parts of the tree after it has been created. Among them, pre-pruning is easier to understand. For example, we can restrict the splitting of each node only when the data of each node reaches a certain number during the training of decision tree; otherwise, it will be retained as leaf node. Or we can limit the proportion of data and stop growing when the proportion of a node in a certain category exceeds a threshold.

Post-pruning is a bit more complicated, requiring us to find the parts that can be pruned and prune the whole tree through some mechanism after the tree is generated. For example, the commonly used Pruning strategy in CART algorithm is CCP, whose Full English name is cost-complexity Pruning, that is, Cost Complexity Pruning. This strategy designs a metric to measure the complexity cost of a subtree, and we can set a threshold for this cost to prune.

The essence of this strategy is the following:


In this formula, c is the cost of pruning, t is the subtree after pruning,Represents the subtree before pruning. R(t) is after pruningThe error cost.Represents the error cost before pruning. Error cost is defined as:, r(t) is the error rate of node T, and P (t) is the proportion of data on T to all data.

Let’s look at an example:

So let’s say we know that we have 100 entries, so let’s plug it into the formula, and we get

The error cost of the subtree is:


So you get

The larger the c is, the greater the deviation caused by pruning, that is, the less the pruning can be done. On the contrary, the smaller the C is, the less the deviation can be reduced. We just need to set the threshold and calculate the C of each subtree to determine whether pruning is possible.


Code implementation


We have already implemented the C4.5 algorithm before, and implementing CART is relatively easy, since it has fewer discrete types than C4.5 and can be treated as continuous.

We just need to change the previous information gain ratio to Gini index:

from collections import Counter

def gini_index(dataset):
    dataset = np.array(dataset)
    n = dataset.shape[0]
    if n == 0:
        return 0
    # sigma p(1-p) = 1 - sigma p^2
    counter = Counter(dataset[:, - 1])
    ret = 1.0
    for k, v in counter.items():
        ret -= (v / n) ** 2
    return ret

def split_gini(dataset, idx, threshold):
    left, right = [], []
    n = dataset.shape[0]
    The new Gini index is calculated after the split
    for data in dataset:
        if data[idx] < threshold:
            left.append(data)
        else:
            right.append(data)
    left, right = np.array(left), np.array(right)
   	# Divide in half and multiply by the proportion
    return left.shape[0] / n * gini_index(left) + right.shape[0] / n * gini_index(right)
Copy the code

Then choose the split function to adjust it slightly, because the smaller the Gini index is, the better, and the larger the previous information gain and information gain ratio are, the better. The framework of the code is basically unchanged, with a few tweaks:

def choose_feature_to_split(dataset):
    n = len(dataset[0])- 1
    m = len(dataset)
    # Record the best Gini, characteristics and thresholds
    bestGini = 1.0
    feature = - 1
    thred = None
    for i in range(n):
        threds = get_thresholds(dataset, i)
        for t in threds:
            Pass through all thresholds and calculate the information gain ratio of each threshold
            ratio = split_gini(dataset, i, t)
            if ratio < bestGini:
                bestGini, feature, thred = ratio, i, t
    return feature, thred
Copy the code

The part of tree construction and prediction is basically the same as the previous C4.5 algorithm, only need to remove the judgment of discrete type, you can refer to the code in the previous article.


conclusion


At this point, we have covered the decision tree model from basic decision tree principles to ID3, C4.5, and CART algorithms. This is enough knowledge to handle interview questions about the decision tree model.

In the actual production process, we don’t use decision trees anymore, not almost, but almost completely. However, its idea is very important, which is the basis of many subsequent models, such as random forest, GBDT and other models, which are established on the basis of decision tree. Therefore, it is very important for us to deeply understand the principle of decision tree for our subsequent advanced learning.

Finally, I sent the complete code to paste. Ubuntu. If you need it, you can reply “Decision tree” in the background of the official account.

If you like this article, if you can, please click the following, give me a little encouragement, but also convenient access to more articles.