• Github: github.com/yingzk/MyML
  • Bo guest: www.yingjoy.cn

1. Introduction to the Decision Tree

1.1. Principle of decision tree

Decision tree is a relatively simple classification algorithm of machine learning supervised learning. Decision tree is a prediction model. It represents a mapping between object attributes and object values. Each node in the tree represents an object, each bifurcated path represents a possible attribute value, and each leaf represents a value for an object along the path from the root to the leaf. The decision tree only has a single output. If complex output is desired, an independent decision tree can be established to deal with different outputs. It is the process of classifying data by a set of rules.

Example: an example of whether to go out for a weekend

Decision tree is the structure that transforms data set into tree, as follows:

1.2. The construction process of decision tree

1. Feature selection: Feature selection refers to the selection of a feature from numerous features in the training data as the splitting standard of the current node. There are many different quantitative evaluation standards for feature selection, so as to derive different decision tree algorithms, such as CART, ID3, C4.5, etc. 2. Decision tree generation: recursively generate sub-nodes from top to bottom according to the selected feature evaluation criteria until the data set is indivisible and the decision tree stops growing. Recursion is the easiest way to understand a tree structure. 3. Pruning: decision trees are easy to overfit. Generally, pruning is needed to reduce the size of the tree structure and alleviate overfitting. There are two kinds of pruning techniques: pre-pruning and post-pruning.

Pseudo code

ifEncounter termination conditions:returnClass labelelse: Search for an optimal feature to classify the data set create branch points Divide each branch node and return the branch points to the main branchreturnBranch nodeCopy the code

1.3. Advantages and disadvantages of decision trees

Decision trees are applicable to both numerical and nominal types (discrete data, where the results of variables are only evaluated in a finite set of targets). They can read data sets and extract rules contained in some columns of data. Decision tree model has many advantages in classification problems. Decision tree has low computational complexity, easy to use, and high efficiency. Decision tree can deal with data with unrelated features, and easily construct easy-to-understand rules, which are usually easy to explain and understand. The decision tree model also has some disadvantages, such as the difficulty in dealing with missing data, over-fitting and ignoring the correlation between attributes in the data set.

1.4. Introduction to three decision tree algorithms

The core part of decision tree algorithm is: how to select partition attributes?

The earliest decision tree algorithm originated from CLS(Concept Learning System) System, that is, Concept Learning System.

It is necessary to use quantitative methods to judge, and there are many quantitative methods, one of which is to measure information classification through “information theory”. Decision tree algorithms based on information theory include: ID3, CART, C4.5 and so on.

ID3 algorithm was invented by Ross Quinlan and is based on Occam’s Razor. The simpler decision tree is better than the larger decision tree (Be Simple). In ID3 algorithm, the evaluation and feature selection are carried out according to the information gain of information theory, and the feature with the largest information gain is selected as the judgment module each time. ID3 algorithm can be used to divide nominal data sets without pruning. In order to eliminate the problem of excessive data matching, adjacent leaf nodes that cannot generate a large amount of information gain can be combined by pruning (such as setting the information gain threshold). Using information gain is actually have a drawback, that is it tend to have a lot of the value of the attribute – that is to say in the training set, a property taken by the number of different values, the more so the more likely it is to take it as a split attribute, which sometimes is meaningless, and ID3 cannot handle data characteristics of continuous distribution, Hence the C4.5 algorithm (and now the COMMERCIAL version of the C5.0 algorithm). The CART algorithm also supports continuously distributed data features.

C4.5 is an improved algorithm of ID3, inheriting the advantages of ID3 algorithm. The C4.5 algorithm uses information gain rate to select partition attributes, which overcomes the deficiency of preference to select attributes with many values when using information gain to select attributes. Can complete the discrete processing of continuous attributes; Ability to process incomplete data. Classification rules generated by C4.5 algorithm are easy to understand and accurate. However, the efficiency is low, because in the process of tree construction, the sequential scanning and sorting of data sets are needed for many times. Also because multiple cube scans are required, C4.5 is only suitable for data sets that can reside in memory.

The full name of CART algorithm is Classification And Regression Tree. It adopts Gini index (select the feature s with the smallest Gini index) as the splitting standard, And it also includes post-pruning operation. Although the ID3 algorithm and C4.5 algorithm can mine as much information as possible in the learning of the training sample set, the decision tree branches generated by them are large and the scale is large. In order to simplify the scale of decision tree and improve the efficiency of decision tree generation, the decision tree algorithm CART, which selects test attributes according to GINI coefficient, appears.

Entropy: Measures the uncertainty of random variables. (purity)

(1) Information entropy

In probability theory, entropy of information gives us a measure of uncertainty, it’s a measure of uncertainty of random variables, entropy is the expected value of information. If the things to be classified may be divided into N categories, respectivelyThe probabilities of each of these are, then the entropy of data set D is defined as:

From the definition:

When the random variable takes only two values, the distribution of D isThe entropy is:.

The higher the entropy value is, the higher the type of data mixing is, which implies that the more possible changes of a variable (instead, it has nothing to do with the specific value of the variable, but only with the type of the value and the probability of occurrence), the more information it carries.

(2) Conditional entropy

Let’s say I have random variables, its joint general distribution is:

The conditional entropyDenotes the uncertainty of random variable Y under the condition of given random variable X, which is defined as the mathematical expectation of entropy of conditional probability distribution of Y for X under given conditions:

(3) Information gain

Information gain refers to the degree of uncertainty reduction of Y after knowing the information of feature X. Defined as:

(4) Gini index

Gini index (Gini unpurity) : Indicates the probability that a randomly selected sample in a sample set will be misclassified.

Note: The smaller the Gini index is, the smaller the probability that the selected sample in the set will be misclassified, that is to say, the higher the purity of the set is; otherwise, the more impure the set is.

So the Gini index is equal to the probability of the sample being selected times the probability of the sample being misclassified

2. Python implementation of ID3

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np
import pandas as pd
import operator

def loadDataSet(a):
    """ import data @ return dataSet: dataSet read """
    # Data processing
    dataSet = pd.read_csv('isFish.csv', delimiter=', ')
    # dataSet = dataSet.replace('yes', 1).replace('no', 0)
    labelSet = list(dataSet.columns.values)
    dataSet = dataSet.values
    return dataSet, labelSet

def calcShannonEnt(dataSet):
    """ Calculate the information entropy of a given dataSet (Shannon entropy) @Param dataSet: dataSet @return shannonEnt: Shannon entropy ""
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        The current sample type
        currentLabel = featVec[- 1]
        # If the current category is not in labelCounts
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob*np.log2(prob)
    return shannonEnt

def splitDataSet(dataSet, axis, value):
    @param dataSet: dataSet @param axis: dataSet: dataSet @param value: dataSet: dataSet: dataSet: dataSet: dataSet @param value: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet: dataSet
    retDataSet = []
    for featVec in dataSet:
        # Extract the same data features
        if featVec[axis] == value:
            reducedFeatVec = list(featVec[:axis])
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFeature(dataSet):
    "" select the optimal partition attribute @param dataSet: dataSet @return bestFeature: Optimal partition attribute """
    # Number of attributes
    numFeature = len(dataSet[0])- 1
    baseEntroy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = - 1
    for i in range(numFeature):
        Get all possible values for the ith feature
        featureList = [example[i] for example in dataSet]
        # remove duplicate values
        uniqueVals = set(featureList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            # The proportion of the dataset with feature I to the total
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * np.log2(prob)
        inforGain = baseEntroy - newEntropy
        
        if inforGain > bestInfoGain:
            bestInfoGain = inforGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    @param classList: Category list @return sortedClassCount[0][0]: category with the most occurrences """
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
        classCount += 1
    # sort
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    # return the number of occurrences
    return sortedClassCount[0] [0]

def createTree(dataSet, labels):
    """ construct decision tree @param dataSet: dataSet @Param Labels: dataSet @return myTree: decision tree """
    classList = [example[- 1] for example in dataSet]
    Stop when the category and attribute are exactly the same
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    # return the largest number of eigenvalues after traversing all eigenvalues
    if (len(dataSet[0= =])1) :return majorityCnt(classList)
    
    Get the best partition attribute
    bestFeat = chooseBestFeature(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    # to empty labels [bestFeat]
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        Recursive calls create a decision tree
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree
    

if __name__ == '__main__':
    dataSet, labelSet = loadDataSet()
    shannonEnt = calcShannonEnt(dataSet)
    tree= createTree(dataSet, labelSet)
    print (tree)
Copy the code

Other algorithms need to be supplemented…