—  Illustrations by Cornelius Bierer —

The decision tree algorithm is very close to our life, through the following case, you will find that we use decision tree algorithm every day, intentionally or unintentionally.

Xiao Ming goes to school every morning. He can walk, take a bus and take uncle Wang’s car next door. At this time, Xiao Ming began to make decisions: first of all, look at the weather, when it does not rain, choose to walk to school; When it rains, I will see whether uncle Lao Wang next door is free. When I am free, I will go to school by Lao Wang’s car. When I am free, I will choose to go to school by bus. As shown in the figure.

case

Decision tree definition

From the above example, we can define a decision tree: the above picture is a decision tree. A decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes, which represent a feature or attribute (weather, availability), and leaf nodes, which represent a class (walking, taking the bus, and taking Uncle Wang’s car next door).

Principle of decision tree algorithm

So how do we construct this tree using a decision tree algorithm? (Is it the hand of God?) The above case is relatively simple, but now we present a very classic case, as shown in the figure. Do we first make decisions based on weather, humidity or wind level? So entropy and information gain.

case

entropy

First, let’s look at what entropy is. Simply put, entropy describes how chaotic (or uncertain) things are. For example, if China enters the World Cup, the uncertainty may be 0, so the entropy may be 0. The uncertainty of a six-sided pot is lower than that of a 12-sided pot, so the entropy is also lower. H = -∑ni=1p(xi)log2p(xi) the entropy of the six faces is 1/6*log21/6, which is 2.585, and so on. The entropy of the 12 faces is 3.585. If you don’t play ball, it is 5 and if you play ball, it is 9, so the entropy calculation is:

Copy the code
  1. -(5/14 log(5/14,2) + 9/14 log(9/14,2))
  2. 0.940

Information gain

Which feature do you divide the dataset by first? We have a principle, which is to make disorderly data orderly, in other words, to make entropy smaller, as small as possible. The information gain is the change in entropy before and after dividing the data set, and the idea here is to make the information gain as big as possible. Taking weather as an example, we calculate the information gain after partitioning:

  • On a sunny day: 2/5 playing, 3/5 not playing, entropy is:
Copy the code
  1. -(2/5 log of 2/5,2) + 3/5 log of 3/5,2)
  2. 0.971
  • Cloudy day entropy: 0
  • Entropy of rainy days: 0.971

The weather

The probability of sunny day, cloudy day and rainy day is 5/14, 4/14 and 5/14, so the entropy after partition is: 5/14 * 0.971+4/14 * 0+5/14 * 0.971, which is 0.693, and the information gain is 0.940-0.693, which is 0.247. Similarly, the information gain of other features can be calculated.

The weather information gain here is the largest, so it is selected as the initial classification basis.

After selecting weather as the first classification basis, the classification that can be correctly classified will end, and the information gain of other features that cannot be correctly classified will continue to be calculated, and the previous operation will continue, with the results as shown in the figure.

The results of

Pseudo code

So the decision tree is a recursive algorithm with the pseudo-code as follows:

Copy the code
  1. def createBranch():
  2. Detect whether the classification labels of all data in the data set are the same:
  3. If so Return class tag
  4.        Else:
  5. Find the best feature for dividing the data set (the feature with the lowest information entropy after dividing, that is, the feature with the largest information gain)
  6. Partition data set
  7. Creating a branch Node
  8. For a subset of each partition
  9. Call createBranch (the function that creates the branch) and add the return result to the branch node
  10. Return branch node

Classification of Marine organisms in decision trees

Problem description and data

To judge whether it is a fish, the data has two characteristics:

  • Whether to survive in the water
  • If you have flippers

Case \

Here we need to manually construct the data ourselves:

Copy the code
  1. def creatDataSet():
  2.    dataSet = [
  3.        [1, 1, 'yes'],
  4.        [1, 1, 'yes'],
  5.        [1, 0, 'no'],
  6.        [0, 1, 'no'],
  7.        [0, 1, 'no']
  8.    ]
  9.    labels = ['no surfacing', 'flippers']
  10.    return dataSet, labels

The dataSet is the data, and labels are the names of the two features.

Calculate the entropy

Here we define a function to calculate the entropy of the dataset:

Copy the code
  1. from math import log
  2. ` `
  3. def calcshannon(dataSet):
  4.    num = len(dataSet)
  5.    labelCounts = {}
  6.    for featVec in dataSet:
  7.        currentLabel = featVec[-1]
  8.        if currentLabel not in labelCounts.keys():
  9.            labelCounts[currentLabel] = 0
  10.        labelCounts[currentLabel] += 1
  11. Shannon = 0.0
  12.    for key in labelCounts:
  13.        prob = float(labelCounts[key])/num
  14.        shannon -= prob * log(prob, 2)
  15.    return shannon

This code is relatively simple, is the incoming data, the last column (namely classification label) entropy.

Partition data set

First, set up a function to partition the data set. The parameters are: data to partition, partition features, and returned eigenvalues. This function will be called in the Choose function to calculate the best partition features.

Copy the code
  1. def splitDataSet(dataSet, axis, value):
  2.    retDataSet = []
  3.    for featVec in dataSet:
  4.        if featVec[axis] == value:
  5.            reduce = featVec[:axis]
  6.            reduce.extend(featVec[axis+1:])
  7.            retDataSet.append(reduce)
  8.    return retDataSet
Copy the code
  1. def choose(dataSet):
  2.    numfeature = len(dataSet[0]) - 1
  3.    baseEntropy = calcshannon(dataSet)
  4. Bestinfogain = 0.0
  5.    bestfeature = -1
  6.    for i in range(numfeature):
  7.        featlist = [example[i] for example in dataSet]
  8.        vals = set(featlist)
  9. NewEntropy = 0.0
  10.        for value in vals:
  11.            subDataSet = splitDataSet(dataSet, i, value)
  12.            prob = len(subDataSet)/float(len(dataSet))
  13.            newEntropy += prob*calcshannon(subDataSet)
  14.        infoGain = baseEntropy - newEntropy
  15.        if (infoGain > bestinfogain):
  16.            bestinfogain = infoGain
  17.            bestfeature = i
  18.        return bestfeature

Create a tree

When all the features are used up and the data cannot be thoroughly divided, majority voting is required to determine the classification of leaf nodes. The code is as follows, similar to the sorting in KNN in the previous paper.

Copy the code
  1. import operator
  2. def majority(classList):
  3.    classCount = {}
  4.    for vote in classList:
  5.        if vote not in classCount.keys():
  6.            classCount[vote] = 0
  7.        classCount[vote] += 1
  8.    sortedcount = sorted(classCount.items(),  key=operator.itemgetter(1), reverse=True)
  9.    return sortedcount[0][0]

Finally, the code to create the tree:

Copy the code
  1. def createTree(dataSet, labels):
  2.    classList = [example[-1] for example in dataSet]
  3.    if classList.count(classList[0]) == len(classList):
  4.        return classList[0]
  5.    if len(dataSet[0]) == 1:
  6.        return majority(classList)
  7.    bestFeat = choose(dataSet)
  8.    bestFeatLabel = labels[bestFeat]
  9.    myTree = {bestFeatLabel:{}}
  10.    del (labels[bestFeat])
  11.    Vals = [example[bestFeat] for example in dataSet]
  12.    uvals = set(Vals)
  13.    for value in uvals:
  14.        sublabels = labels[:]
  15.        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), sublabels)
  16.    return myTree

There are two conditions to terminate recursion: first, all categories can be correctly divided, and second, the feature is complete.

The results of

Advantages and disadvantages of the algorithm

  • Advantages: Easy to understand
  • Disadvantages: easy to overfit

\

Luo Pan is a columnist for the Python Chinese community and author of Learning Web Crawlers from Scratch. The column address: www.jianshu.com/u/9104ebf5e…


\

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform. \

Click **** to read the original article and become a free member of **** community