The decision tree

1 Basic Concepts

Amount of information: Measure the degree of uncertainty of an event. The higher the uncertainty, the greater the amount of information. Uncertainty is generally defined by the probability of event occurrence, and the amount of information is the log operation based on probability density function


I ( x ) = log p ( x ) I(x)=-\log p(x)

Information entropy: measures the uncertainty degree of an event set, which is the uncertainty expectation of all events in the event set


H ( x ) = E x _ X [ I ( x ) ] = E x _ X [ log p ( x ) ] = x X [ p ( x ) log p ( x ) ] \begin{aligned} H(x)&=E_{x\_X}[I(x)]\\ &=E_{x\_X}[-\log p(x)]\\ &=\sum_{x\in X}[-p(x)\log p(x)] \end{aligned}

Relative entropy (KL divergence) : Represents an asymmetric measure of the difference between two probability distributions. KL divergence can also be used from the point of view of information theory. KL divergence from this point of view can also be called relative entropy, which actually describes the difference of information entropy between two probability distributions


K L ( P Q ) = P ( x ) P ( x ) Q ( x ) KL(P||Q)=\sum P(x)\frac{P(x)}{Q(x)}\\

Cross entropy: Cross entropy is the sum of the information entropy of truth distribution and KL divergence. The entropy of truth distribution is determined and independent of the parameter θ of the model. Therefore, the optimization of cross entropy and optimization of KL divergence (relative entropy) are the same when gradient descent is derived


H ( p . q ) = P log Q H(p,q)=-\sum P\log Q

Joint entropy: it measures the information entropy of a new large event set formed after the combination of two event sets


H ( X . Y ) = p ( x . y ) log p ( x . y ) H(X,Y)=-\sum p(x,y) \log p(x,y)

Conditional entropy: Conditional entropy of event set Y = joint entropy – information entropy of event set X, which is used to measure the reduction of uncertainty of event set Y on the basis of known event set X


H ( Y X ) = H ( X . Y ) H ( X ) H(Y|X)=-H(X,Y)-H(X)

2 Principle of decision tree

2.1 the principle

Decision Tree is a non-parametric supervised learning method. It can summarize Decision rules from a series of data with features and labels and present these rules in the structure of Tree graph, which has solved classification and regression problems. The goal of decision tree is to establish a tree structure that can be used for decision making, including a root node and several leaf nodes and non-leaf nodes. The leaf node corresponds to each category, and the non-leaf node corresponds to an attribute value.

2.2 process

  1. Starting from the root node, the impurity of each feature is calculated, and the impurity with the highest purity is selected as the splitting value of the current node
  2. According to the value of the current split node, divide the data into two parts, and then establish the left and right subtrees respectively
  3. If the information gain before and after splitting does not exceed the threshold, or the current node belongs to the same category, the splitting stops

2.3 How to determine the optimal split point

  • ID3 – Information gain

    The information gain before and after splitting is calculated and the maximum information gain is selected as the current best splitting feature.

    For sample set D, the number of categories is K, and the empirical entropy of data set D is expressed as

Why do you take the logarithm in this formula? Why take a negative number? And actually, that’s what the definition says, you take the logarithm just to make it computationally easy, to convert multiplication to addition; The negative number is chosen because it does not include negative numbers, and the information entropy is negative, which does not accord with our cognition, that is, the higher the uncertainty, the greater the information entropy should be.

Including CkC_kCk is belong to the first sample set D k class sample subset, ∣ Ck ∣ | C_k | ∣ Ck ∣ said the number of elements in the subset, ∣ D ∣ | D | ∣ D ∣ said the number of elements in the sample collection.

And then calculate A feature A experience for A data set D conditional entropy H (D ∣ A) H (D | A) H (D ∣ A)

Where DiD_iDi represents the subset of samples whose feature A is the ith value in D, and DikD_{IK}Dik represents the subset of samples belonging to the KTH class in DiD_iDi.

Then the information gain g(D,A)g(D,A)g(D,A) can be expressed as the difference between the two, and can be obtained

Conclusion: ID3 has a preference for features with a large number of eigenvalues and can only process discrete data. Sensitive to missing values

  • C4.5 – Information gain rate

    The information gain ratio of feature A to dataset D is defined as

    Among them

Conclusion: C4.5 uses the information gain rate to overcome the feature with a large number of information gain preference features. In fact, it introduces the information entropy of features as the denominator to punish the high cardinal number categories. Therefore, C4.5 has a preference for features with fewer eigenvalues. It can process both discrete data and continuous data, and can be a multi-fork tree.

  • Cart-gini index

    Gini describes the purity of data, similar to information entropy:

    CART selects the feature with the smallest Gini index and its corresponding segmentation point for classification in each iteration. However, different from ID3 and C4.5, CART is A binary tree. Binary cutting method is adopted to cut the data into two pieces according to the value of feature A at each step and enter the left and right subtrees respectively. The Gini index of feature A is defined as

Conclusion: CART can be classified and regressive; It can process both discrete data and continuous data. But only binary trees, features that split can split again.

2.4 When does splitting stop?

  1. The information gain before and after splitting is less than the given threshold
  2. The number of samples of the current node is less than the given threshold
  3. The samples of the current node belong to the same category
  4. When the tree reaches a certain depth

2.5 How do I Prune?

  • Pre-pruning: during training, if the prediction accuracy becomes smaller after splitting, splitting is abandoned
  • Post-pruning: first train a complete tree, start from leaf nodes, calculate the accuracy rate; Remove the current node and calculate the accuracy rate again. If the accuracy rate improves after removing the current node, prune the branch

2.6 Advantages and disadvantages of decision tree

advantages

  • The calculation is simple and easy to explain
  • There is no need to normalize data

disadvantages

  • A tree is easy to overfit
  • Unstable, small data changes, will build different decision trees
  • Greedy algorithm is used to guarantee local optimum, but not global optimum

3 Decision tree case

3.1 Build a decision tree

steps

  1. Instantiate, establish evaluation model object
  2. Train the model through model interface
  3. Obtain the required information through the model interface

Decision tree in SkLearn

  1. Eight parameters: Criterion, two random-dependent parameters (random_state, splitter), five pruning parameters (max_depth, min_samples_split, min_samples_leaf, max_feature, Min_impurity_decrease)
  2. One property: Feature_importances_
  3. Four interfaces: FIT, Score, apply and predict

code

def buildTree() :
    Load the wine data set
    wine = load_wine()
    print(wine.data.shape)

    print(wine.target)

    # Organize wine into a table
    pd.concat([pd.DataFrame(wine.data), pd.DataFrame(wine.target)], axis=1)
    print(wine.feature_names)
    print(wine.target_names)

    # Divide training set and test set
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data, wine.target, test_size=0.3)
    print(Xtrain.shape)
    print(Xtest.shape)

    # Build a model
    # random_state is used to set the random mode parameter in the branch; Splitter is used to control random choices in the decision tree
    # max_depth Limits the maximum depth of the tree; Min_samples_leaf & min_samples_split The minimum number of training samples for each child node after the split
    clf = tree.DecisionTreeClassifier(criterion="entropy", random_state=30)
    clf.fit(Xtrain, Ytrain)
    score = clf.score(Xtest, Ytest)
    print(score)

    # Draw a tree
    import graphviz
    feature_name = wine.feature_names
    dot_data = tree.export_graphviz(clf, feature_names=feature_name, class_names=['class_0'.'class_1'.'class_2'],
                                    filled=True, rounded=True)
    graph = graphviz.Source(dot_data)
    print(graph)

    # Explore the feature importance of decision trees
    print(clf.feature_importances_)
    feature_importance = [*zip(feature_name, clf.feature_importances_)]
    print(feature_importance)

    # apply () returns the index of the leaf node where each test sample resides
    print(clf.apply(Xtest))
    # predict () returns the classification result of each test sample: it returns a one-dimensional array of size N, and the ith value of the one-dimensional array is the label of the ith predicted sample
    print(clf.predict(Xtest))
    # predict_proba returns an array of n rows and K columns. The value on row I and column J is the probability that the model predicts the label J for the ith prediction sample. The sum of each row should be equal to 1
    print(clf.predict_proba(Xtest))
Copy the code

3.2 Determine the optimal pruning parameters through grid search

def max_parameter(Xtrain, Ytrain, Xtest, Ytest) :
    test = list(a)for i in range(10):
        clf = tree.DecisionTreeClassifier(max_depth=i + 1, criterion="entropy", random_state=30, splitter="random")
        clf = clf.fit(Xtrain, Ytrain)
        score = clf.score(Xtest, Ytest)
        test.append(score)

    plt.plot(range(1.11), test, color="red", label="max_depth")
    plt.legend()
    plt.show()
Copy the code

3.3K fold cross validation

def cross_val() :
    """ K-fold cross-validation :return: """
    from sklearn.datasets import load_boston
    from sklearn.model_selection import cross_val_score
    from sklearn.tree import DecisionTreeRegressor
    boston = load_boston()
    regressor = DecisionTreeRegressor(random_state=0)
    val_score = cross_val_score(regressor, boston.data, boston.target, cv=10, scoring="neg_mean_squared_error")
    print(val_score)
Copy the code

3.4 Prediction of Titanic survivors

def titanic_example() :
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import cross_val_score
    import matplotlib.pyplot as plt
    import numpy as np

    data = pd.read_csv(r"data.csv", index_col=0)
    print(data.head())
    print(data.info())

    # delete columns with too many missing values, and columns that are not related to the predicted y from observation judgment
    data.drop(["Cabin"."Name"."Ticket"], inplace=True, axis=1)

    # Handle missing values, fill in the columns with more missing values, some features only have one or two values, you can directly delete records
    data["Age"] = data["Age"].fillna(data["Age"].mean())
    data = data.dropna()

    # Convert binary variables to numeric variables
    Unlike apply(int(x)), astype converts text classes to numbers. This method can be used to easily convert binary features to 0~1
    data["Sex"] = (data["Sex"] = ="male").astype("int")

    # Convert tripartite variables to numeric variables
    labels = data["Embarked"].unique().tolist()
    data["Embarked"] = data["Embarked"].apply(lambda x: labels.index(x))

    # View the processed data set
    print(data.head())

    Extract tag and feature matrix, divided into test set and training set
    X = data.iloc[:, data.columns != "Survived"]
    y = data.iloc[:, data.columns == "Survived"]

    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3)

    # Fix indexes for test sets and training sets
    for i in [Xtrain, Xtest, Ytrain, Ytest]:
        i.index = range(i.shape[0])

    # View the training set and test set
    print(Xtrain.head())

    # Build a model
    clf = DecisionTreeClassifier(random_state=25)
    clf = clf.fit(Xtrain, Ytrain)
    score_ = clf.score(Xtest, Ytest)
    print(score_)
    score = cross_val_score(clf, X, y, cv=10).mean()
    print(score)

    # Observe the model fitting status at different MAX_depths
    tr = []
    te = []
    for i in range(10):
        clf = DecisionTreeClassifier(random_state=25
                                     , max_depth=i + 1
                                     , criterion="entropy"
                                     )
        clf = clf.fit(Xtrain, Ytrain)
        score_tr = clf.score(Xtrain, Ytrain)
        score_te = cross_val_score(clf, X, y, cv=10).mean()
        tr.append(score_tr)
        te.append(score_te)
    print(max(te))
    plt.plot(range(1.11), tr, color="red", label="train")
    plt.plot(range(1.11), te, color="blue", label="test")
    plt.xticks(range(1.11))
    plt.legend()
    plt.show()

    Adjust parameters with grid search
    gini_thresholds = np.linspace(0.0.5.20)

    parameters = {'splitter': ('best'.'random'),'criterion': ("gini"."entropy"),"max_depth": [*range(1.10)].'min_samples_leaf': [*range(1.50.5)].'min_impurity_decrease': [*np.linspace(0.0.5.20)]}

    clf = DecisionTreeClassifier(random_state=25)
    GS = GridSearchCV(clf, parameters, cv=10)
    GS.fit(Xtrain, Ytrain)

    print(GS.best_params_)
    print(GS.best_score_)
Copy the code