So far, WE have completed support vector machine SVM, decision tree, KNN, Bayesian, linear regression and Logistic regression. For other algorithms, please allow Taoye to give credit here for the first time. Later, we will have the opportunity and time to make up for you.

Update so far, also received part of the reader’s praise. It’s not much, but thank you very much for your support, and I hope everyone who reads it will find it rewarding.

The entire content of this series of articles is Taoye pure hand written, and also references a number of books and open sources. The total number of words in this series is around 15W (including source code), which will be gradually filled in later. More technical articles can be found on Taoye’s official account: Cynical Coder. The document can be circulated freely, but be careful not to modify its contents.

If you have any questions you don’t understand in the article, you can directly ask them, and Taoye will reply as soon as you see them. Meanwhile, you are welcome to come here to privately urge Taoye: Cynical Coder. Taoye’s personal contact information is also available on the public account. There are some things Taoye can only secretly say to you there (# ‘O’)

To improve your reading experience, Taoye, a series of hand-ripping machine learning articles, has been compiled into PDF and HTML, and is available for free by replying to [666] on the public id [Cynical Coder].

Previously, we have explained in detail the initial optimization process of linear SVM and SMO. For details, please refer to:

  • Machine Learning in Action — Dissecting support vector machines and optimizing SMO
  • “Machine Learning in Action” — Analysis of support vector machines, one-handed crazy linear SVM

We will save the non-linearity of SVM for next week

In this article, we will first take a look at the content of decision tree. Compared with SVM, decision tree is much simpler and there is no complicated formula. In my understanding, decision tree simply means that ** develops a tree structure model based on existing data training, so that we can predict the data with unknown classification results according to this model. ** A simple decision tree is somewhat similar to the binary sorting tree we learned about data structures. Of course, in addition to classification, decision trees can also carry out regression, which is mainly discussed here.

There will probably be two articles on decision trees. The first part mainly explains the basic knowledge of decision tree precedent, and the process of decision training, which is our article, focusing on theory and derivation, but also on understanding. The main content of the second article is the actual code combat of decision tree, and the actual code combat is carried out on the premise of being familiar with the theoretical basis, so it is necessary to understand the theoretical knowledge related to decision tree. If possible, there will be a third article, mainly about Guan Yu’s pruning operation to prevent decision tree overfitting.

What is a decision tree

As for the definition of decision tree, Li Hang — Statistical Learning Methods (2nd Edition) describes it as follows:

Classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: iternal nodes and leaf nodes. The inner node represents a feature or attribute, and the leaf node represents a class.

The internal nodes mentioned above can be understood as the attribute characteristics of the sample, while the leaf nodes can be understood as the label of the sample, and the directed edges can be understood as different attribute characteristics results. In general, when drawing decision trees, internal nodes are represented by boxes or rectangles, while leaf nodes are represented by circles or ellipses. (Note: this is not absolute, and may be expressed differently in different sources. For example, circles are used to represent internal nodes and squares are used to represent leaf nodes in Statistical Learning Methods, while in Machine Learning In Action, circles are expressed in opposite ways.)

Boring text always reduces the reader’s desire to read, so let’s use a practical example to enhance the reader’s understanding of the decision tree.

For example, there are often multiple indicators when describing the beauty and ugliness of a girl, and this example feels a bit dangerous, so let’s change it.

Example source: Li Hang — Statistical Learning Methods (2nd edition)

For example, we have no Money and want to go to the bank for a loan. At this time, the bank will decide whether to lend to you according to your own personal collateral characteristics. Assuming that there are four attribute features, namely age, job, house and credit status, these attribute features are equivalent to internal nodes, label (result) lending is equivalent to leaf nodes, and the results of different attribute features represent directed edges, such as youth, middle age and old age. There are directed edges for good, fair, very good and so on.

For this, we can manually draw a simple decision tree model based on our feelings:

We know that the above decision tree is only implemented manually by ourselves and may not be applied to solve practical problems. The task of decision tree is to obtain such a tree structure model based on the existing data sample set, so as to predict the classification of unknown data. In order to obtain such an ideal decision tree as possible, we have to consider the following questions:

  • What should be the order of selection of indicators (attribute features) in the decision tree? Which features should be given priority?
  • If there is an over-fitting problem in the constructed decision tree, that is, our training is good, but the test is GG, how should we solve this problem?

In this paper, the main content is the priority selection of attribute features.

Second, the selection of attribute characteristics

For the selection of attribute features, we should choose the features with obvious classification ability to the training data set first, so as to improve the learning efficiency of our decision tree. If we say that the classification result of an attribute feature is basically no different from random classification, we can say that the attribute feature has no classification ability basically.

For example, there are 12 data samples about girl’s beauty and ugliness, six of them are ugly, the other six are beautiful women, six has three ugly height is 165 cm, three is 185 cm, and the other six beauty so tall and so, we only through the height to the beauty of a woman judge is basically no classification ability. (This is just an example and does not represent an opinion)

Therefore, in our ideal situation, the generation of decision tree should be like this: with the continuous progress of the decision process, we want the samples contained in the branch nodes of the decision tree to belong to the same category (with the same label) as much as possible, that is, the “purity” of nodes is getting higher and higher.

Therefore, the key technology of ** decision tree lies in the selection of attribute features. ** For the selection of attribute features, we usually have three criteria: information gain, information gain ratio (gain rate) and Gini index.

2.1 Information gain

Mentioned above sample concentration, for example I have 10 girls here, 10 are beauty, then say the sample concentration is high right now, if beauty and ugly woman are 55 minutes, then the concentration at this time is lower. This is what “concentration” means here, and it is mainly for labels (results).

The most common indicator of sample “concentration” is ** information entropy. Assuming that the proportion of the KKK class (total NNN class) samples in the current sample set DDD ** is PKP_kpK, then the information entropy of the sample DDD is:


E n t ( D ) = k = 1 n p k l o g 2 ( p k ) 1) (2 – Ent(D) = -\sum_{k=1}^np_klog_2(p_k) \tag{2-1}

Through the above formula, we can know that the information entropy of 10 beautiful women (a category) is


E n t ( D ) = 1 l o g 2 1 = 0 Ent(D)=-1 * log_21=0

The information entropy of beauty and ugliness is as follows:


E n t ( D ) = 1 2 l o g 2 1 2 1 2 l o g 2 1 2 = 1 Ent(D) = -\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2}=1

Through the above simple calculation of information entropy, we can know that the smaller the value of information entropy is, the higher the purity is.

Now that we know the formula for information gain and what it means, how do we calculate it in code for a sample set of data? So let’s try it.

The data samples used this time are from Statistical Learning Methods (second edition) by Li Hang. The data are about loan applications, as follows:

Now let’s calculate the information entropy of this data sample set in code form:

First, create an establish_data method to set up a 2-d array for storing information about the above data sample set (for attributes 0,1,2, etc., please refer to the above table for more details) :

"" Author: Taoye wechat public number: Coder Explain: "" def establish_data(): Data = [[0, 0, 0, 0, 'N'], # The last item indicates whether or not the loan [0, 0, 0, 1, 'N'], [0, 1, 0, 1, 'Y'], [0, 1, 1, 0, 'Y'], [0, 0, 0, 0, 'N'], [1, 0, 0, 0, 'N'], [1, 0, 0, 1, 'N'], [1, 1, 1, 1, 'Y'], [1, 0, 1, 2, 'Y'], [1, 0, 1, 2, 'Y'], [2, 0, 1, 2, 'Y'], [2, 0, 1, 1, 'Y'], [2, 1, 0, 1, 'Y'], [2, 1, 0, 2, 'Y'], [2, 0, 0, 0, 'N']] return np.array(data)Copy the code

We then create a calc_information_entropy method to calculate the entropy of information, which can be coded in conjunction with the above formula. In addition, in order to classify the results of lending, we use the PANDAS package for processing. You may refer to the documentation for the use of PANDAS, which Taoye will also compile later. The pandas module does not need to be used for the pandas module. The calc_information_entropy code is as follows:

"" Author: Taoye wechat 下 载 : Coder Explain: Calculate information entropy "" def calc_information_entropy(data): data_number, _ = data.shape information_entropy = 0 for item in pd.DataFrame(data).groupby(_ - 1): print(item[1]) proportion = item[1].shape[0] / data_number information_entropy += - proportion * np.log2(proportion) return information_entropyCopy the code

Let’s run the above code to see the specific information entropy result:

Now that you know what entropy is, and you can calculate entropy manually for a sample set, let’s take the information gain out.

Suppose an attribute characteristic AAA has VVV possible values {A1, A2,.. ,aV}\{a^1, a^2,.. , a^V\}{a1,a2,.. AV}, if aaa is used to divide the sample set DDD, this will result in VVV branch nodes, where the VVV branch node contains all the DDD samples with aVa^VaV on attribute AAA, denoted as DVD^VDV. Thus, the ** “information gain” ** obtained by dividing the sample set DDD with attribute feature AAA can be calculated as:


G a i n ( D . a ) = E n t ( D ) v = 1 V D v D E n t ( D v ) (2 – (2) Gain(D, a)=Ent(D)-\sum_{v=1}^V\frac{D^v}{D}Ent(D^v) \tag{2-2}

The above is an explanation of information gain in machine Learning by Zhou Zhihua.

Boring text always reduces the reader’s desire to read, and the above symbols are a bit too many. Let’s take the loan data above to understand the above information gain:

Let’s calculate its information gain by taking age as a separate attribute feature, which has three possible values of youth, middle age and old age. That is to say, the above a= age A = age A = age {A1, A2… ,aV}={a^1,a^2,… ,a^V\}=\{youth, middle age, old age \}{a1,a2… ,aV}={youth, middle age, old age}, and data sample set DDD is the above 15 data samples. Since age has three possible values, three branches will be generated at this time, namely V=3V=3V=3. We have calculated Ent(D)=0.971Ent(D)=0.971Ent(D)=0.971, Sum_ {v=1}^ v \frac{D^v}{D}Ent(D^v)∑v= 1VDvDEnt(Dv)\sum_{v=1}^ v \frac{D^v}{D}Ent(D^v)

Through the data, we observed the following information:

  • In terms of age, there are five attributes: youth, middle age and old age.
  • Two out of five young people are allowed to borrow and three are not; Three of them were allowed to borrow, two were not allowed to borrow; Four of them are allowed to borrow, one is not allowed to borrow.

For this, we can calculate:


v = 1 V D v D E n t ( D v ) = 5 15 ( 2 5 l o g 2 2 5 3 5 l o g 2 3 5 ) + 5 15 ( 3 5 l o g 2 3 5 2 5 l o g 2 2 5 ) + 5 15 ( 4 5 l o g 2 4 5 1 5 l o g 2 1 5 ) = 0.883 \begin{aligned} \sum_{v=1}^V\frac{D^v}{D}Ent(D^v) & = \frac{5}{15}(-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5})\\ & +\frac{5}{15}(-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5}) \\ & + \ frac {5} {15} (- \ frac {4} {5} log_2 \ frac {4} {5} – \ frac {1} {5} log_2 \ frac {1} {5}) = 0.883 \ \ \ & end} {aligned

In this regard, the information gain value corresponding to the attribute feature of the age at which we calculate is:


G a i n ( D . age ) = 0.971 0.883 = 0.083 Gain(D, age) = 0.971 — 0.883=0.083

We can think of the latter term as conditional probability. In addition, the larger the information gain is, the stronger the classification ability of the corresponding attribute features will be, that is, the features we should select first.

Similarly, we can calculate the corresponding information gain of job, house and credit attributes:


G a i n ( D . work ) = 0.324 G a i n ( D . house ) = 0.420 G a i n ( D . credit ) = 0.363 \begin{aligned} & Gain(D, job)=0.324 \\ & Gain(D, house)=0.420 \\ & Gain(D, credit)=0.363 \end{aligned}

By comparing the information gain value of elder brother’s attribute feature, it can be found that house is the largest attribute feature, so it should be our preferred attribute feature.

Knowing the calculation of information gain and its significance, let’s calculate it by code (see note for code meaning) :

Import numpy as NP import pandas as pd np.__version__ pd.__version__ """ Author: Taoye Coder Explain: "" def establish_data(): Data = [[0, 0, 0, 0, 'N'], # The last item indicates whether or not the loan [0, 0, 0, 1, 'N'], [0, 1, 0, 1, 'Y'], [0, 1, 1, 0, 'Y'], [0, 0, 0, 0, 'N'], [1, 0, 0, 0, 'N'], [1, 0, 0, 1, 'N'], [1, 1, 1, 1, 'Y'], [1, 0, 1, 2, 'Y'], [1, 0, 1, 2, 'Y'], [2, 0, 1, 2, 'Y'], [2, 0, 1, 1, 'Y'], [2, 1, 0, 1, 'Y'], (2, 1, 0, 2, 'Y'], [2, 0, 0, 0, 'N']] return np. The array (data) "" "Author: Taoye WeChat public no. : Cynical Coder Explain: calculate information entropy "" def calc_information_entropy(data): data_number, _ = data.shape information_entropy = 0 for item in pd.DataFrame(data).groupby(_ - 1): proportion = item[1].shape[0] / data_number information_entropy += - proportion * np.log2(proportion) return Information_entropy """ Author: Taoye Def handle_data(data, axis, value) def handle_data(data, axis, value): result_data = list() for item in data: if item[axis] == value: result_data.append(item) return result_data """ Author: Def calc_information_gain(data): def calc_information_gain(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the information entropy of the total data set Max_information_gain, best_feature = 0.0, -1 # Initialize maximum information gain and corresponding feature index for index in range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(feat_list) new_entropy = 0.0 for set_item in feat_set: Sub_data = handle_data(data, index, Proportion = len(sub_data)/float(data.shape[0]) # new_entropy += proportion * Calc_information_entropy (NP.array (sub_data)) temp_information_gain = base_entropy - new_entropy # Computes information gain Print (" %d ") print(" %d ") print(" %d ") Temp_information_gain (temp_information_gain > max_information_gain)) Max_information_gain, best_feature = temp_information_gain, index Return best_feature if __name__ == "__main__": Data = establish_data() print(" %d" % calc_information_gain(data))Copy the code

Run the above code, you can see the information gain of each attribute feature output and the feature index corresponding to the maximum information gain:

As we can see, it is the same as our manual calculation, so at this time we should choose the attribute with index 2 as the decision criteria, namely the house.

2.2 Information gain ratio (gain rate)

When we use information gain as the feature index, we tend to select features with more values.

For example, we take ID as a classification attribute of the above data set, and it can be found through calculation that the information gain is the largest, but in fact the ID has no classification ability for categories, so the decision tree thus obtained does not have generalization ability and cannot effectively predict the new samples.

In order to solve the problem of information gain, we introduce the concept of information gain ratio (gain rate). The famous C4.5 algorithm uses “gain rate” as the optimal partition attribute. Gain rate is defined as:


G a i n r a t i o ( D . a ) = G a i n ( D . a ) I V ( a ) (2-3) Gain_ratio(D, a)=\frac{Gain(D, a)}{IV(a)} \tag{2-3}

Among them, I V ( a ) = v = 1 V D v D l o g 2 D v D (2-4) Among them, the IV (a) = – \ sum_ = 1} {v ^ v \ frac {| D | ^ v} {| D |} log_2 \ frac {^ v | | D} {| D |} \ tag {2-4}

IV(a)IV(a)IV(a) is referred to as the intrinsic value of AAA. In general, the greater the number of aaa values, the greater the value of IV(a), IV(a), IV(a). For example, aaa has three values: youth, middle age, and old age.

For the above loan data set, credit conditions are fair, good and very good, with a ratio of 515, 615, 415\frac{5}{15}, \ FRAc {6}{15}, \ FRAc {4}{15}155, 156, 154. Stinger, we can calculate the “proper value” of credit situation:


I V ( credit ) = v = 1 V D v D l o g 2 D v D = 5 15 l o g 2 5 15 6 15 l o g 2 6 15 4 15 l o g 2 4 15 = 1.566 \ begin} {aligned (credit) & = – IV \ sum_ = 1} {v ^ v \ frac {| D | ^ v} {| D |} log_2 \ frac {^ v | | D} {| D |} \ \ & = – \ frac {5} {15} log_2 \ frac {5} {15} – \ frac {6} {15} log_2 \ frac {6} {15} – \ frac {4} {15} log_2 \ frac {4} {15} \ \ & = 1.566 \ end} {aligned

Therefore, for the credit attribute, its gain rate is:


G a i n _ r a t i o ( D . credit ) = G a i n ( D . credit ) I V ( credit ) = 0.363 1.566 = 0.232 \ begin} {aligned Gain \ _ratio (D, credit) & = \ frac {Gain (D, credit)} {IV (credit)} \ \ & = \ frac {0.363}} {1.566 = 0.232 \ \ \ & end} {aligned

Similarly, we can calculate the gain rates of other attributes:


G a i n _ r a t i o ( D . age ) = 0.052 G a i n _ r a t i o ( D . work ) = 0.352 G a i n _ r a t i o ( D . house ) = 0.433 \begin{aligned} Gain\_ratio(D, age) & =0.052 \\ Gain\_ratio(D, job) & =0.352 \\ Gain\_ratio(D, job) & =0.352 \\ Gain\_ratio(D, job) House) & =0.433 \\ \end{aligned}

The specific code to calculate the gain rate can be referred to as follows:

"" Author: Taoye wechat 下 载 : Coder Explain: Calculate gain rate "" def calc_gain_ratio(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the entropy of information for index in the total data set range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(feat_list) new_entropy, iv = 0.0, 0.0 for set_item in Feat_set: Sub_data = handle_data(data, index, Proportion = len(sub_data)/float(data.shape[0]) # new_entropy += proportion * calc_information_entropy(np.array(sub_data)) iv += -proportion * np.log2(proportion) temp_information_gain = Base_entropy - new_entropy # calculate information gain gain_ratio = temp_information_gain/iv print(" the gain rate corresponding to the attribute attribute is %.3f, The information gain is %.3f" % (index + 1, GAIN_ratio, temp_information_gain)) # Output the information gain for each featureCopy the code

The running results are as follows:

In THE C4.5 algorithm, the decision on the selection of attribute features is not made directly by using gain rate alone, but by selecting higher-than-average attributes from the information gain as candidates, and then selecting the one with the highest gain rate from the candidates.

We will talk about the C4.5 algorithm later.

2.3 Gini Index

The Gini index is another indicator for selecting attributes.

As mentioned above, information entropy is a standard to describe the purity of data samples, and the Gini value in the Gini index can also reflect the purity of data samples. Gini value of DDD can be expressed as Gini(D)Gini(D)Gini(D) (where KKK means KKK label results) :


G i n i ( D ) = 1 k = 1 y p k 2 (2-5) Gini(D)=1-\sum_{k=1}^{|y|}p_k^2 \tag{2-5}

The smaller Gini(D)Gini(D), the higher the purity of the data sample. Gini_index(D,a)Gini\_index(D,a)Gini_index(D,a) can be defined as:


G i n i _ i n d e x ( D . a ) = v = 1 V D v D G i n i ( D v ) (2-6) Gini\_index(D, a)=\sum_{v=1}^V\frac{|D^v|}{|D|}Gini(D^v) \tag{2-6}

Let’s also separately calculate the Gini index of the above loan data set.

For the attribute features of credit, the manual calculation process of gini index is as follows:


G i n i _ i n d e x ( D . credit ) = v = 1 V D v D G i n i ( D v ) = 5 15 ( 1 ( 4 5 ) 2 ( 1 5 ) 2 ) + 6 15 ( 1 ( 2 6 ) 2 ( 4 6 ) 2 ) + 4 15 ( 1 ( 4 4 ) 2 ( 0 4 ) 2 ) = \begin{aligned} Gini\_index(D, Credit) & = \ sum_ = 1} {v ^ {n} \ frac {^ v | | D} {| D |} Gini (D) ^ v \ \ & = \ frac {5} {15} (1 – (\ frac {4} {5}) ^ 2 – (\frac{1}{5})^2)+\frac{6}{15}(1-(\frac{2}{6})^2 – (\frac{4}{6})^2)+\frac{4}{15}(1-(\frac{4}{4})^2 – (\frac{0}{4})^2) \\ & = \end{aligned}

For other attributes, the gini index can be calculated by the reader using the above procedure (really simple). The calculation code of gini index is as follows:

"" Def calc_gini_index(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the entropy of information for index in the total data set range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(Feat_list) new_entropy, iv, gini_index = 0.0, 0.0, 0.0 for set_item in Feat_set: Sub_data = handle_data(data, index, set_item) temp_df = pd.DataFrame(sub_data) yes_proportion = temp_df[temp_df[temp_df.shape[1]-1] == "Y"].shape[0] / len(sub_data) gini_value = 1 - (yes_proportion ** 2) - ((1- yes_proportion) ** 2) proportion = len(sub_data) / Float (data.shape[0]) # calculate the proportion of the subset new_entropy += PROPORTION * calc_information_entropy(np.array(sub_data)) iv += - PROPORTION * Np. log2(PROPORTION) gini_index += gini_value * PROPORTION # Calculate the Gini index temp_information_gain = base_entropy - New_entropy # calculate gain_ratio = temp_information_gain/iv print(" the information gain corresponding to the attribute %d is %.3f, the gain rate is %.3f, Gini index is %.3F "% (index + 1, temp_information_gain, GAIN_ratio, gini_index)Copy the code

Running results:

The above is the calculation of gini index and related codes. Generally speaking, the smaller the Gini index, the priority is divided.

The complete code for the above is as follows:

Import numpy as NP import pandas as pd np.__version__ pd.__version__ """ Author: Taoye Coder Explain: "" def establish_data(): Data = [[0, 0, 0, 0, 'N'], # The last item indicates whether or not the loan [0, 0, 0, 1, 'N'], [0, 1, 0, 1, 'Y'], [0, 1, 1, 0, 'Y'], [0, 0, 0, 0, 'N'], [1, 0, 0, 0, 'N'], [1, 0, 0, 1, 'N'], [1, 1, 1, 1, 'Y'], [1, 0, 1, 2, 'Y'], [1, 0, 1, 2, 'Y'], [2, 0, 1, 2, 'Y'], [2, 0, 1, 1, 'Y'], [2, 1, 0, 1, 'Y'], (2, 1, 0, 2, 'Y'], [2, 0, 0, 0, 'N']] return np. The array (data) "" "Author: Taoye WeChat public no. : Cynical Coder Explain: calculate information entropy "" def calc_information_entropy(data): data_number, _ = data.shape information_entropy = 0 for item in pd.DataFrame(data).groupby(_ - 1): proportion = item[1].shape[0] / data_number information_entropy += - proportion * np.log2(proportion) return Information_entropy """ Author: Taoye Def handle_data(data, axis, value) def handle_data(data, axis, value): result_data = list() for item in data: if item[axis] == value: result_data.append(item) return result_data """ Author: Def calc_information_gain(data): def calc_information_gain(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the information entropy of the total data set Max_information_gain, best_feature = 0.0, -1 # Initialize maximum information gain and corresponding feature index for index in range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(feat_list) new_entropy = 0.0 for set_item in feat_set: Sub_data = handle_data(data, index, Proportion = len(sub_data)/float(data.shape[0]) # new_entropy += proportion * Calc_information_entropy (NP.array (sub_data)) temp_information_gain = base_entropy - new_entropy # Computes information gain Print (" %d ") print(" %d ") print(" %d ") Temp_information_gain (temp_information_gain > max_information_gain)) Max_information_gain, best_feature = temp_information_gain, index "" Def calc_gain_ratio(data): "" Def calc_gain_ratio(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the entropy of information for index in the total data set range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(feat_list) new_entropy, iv = 0.0, 0.0 for set_item in Feat_set: Sub_data = handle_data(data, index, Proportion = len(sub_data)/float(data.shape[0]) # new_entropy += proportion * calc_information_entropy(np.array(sub_data)) iv += -proportion * np.log2(proportion) temp_information_gain = Base_entropy - new_entropy # calculate information gain gain_ratio = temp_information_gain/iv print(" the gain rate corresponding to the attribute attribute is %.3f, Information gain is %.3f" % (index + 1, GAIN_ratio, temp_information_gain) Coder Explain: calculate gini index "" def calc_gini_index(data): Feature_number = data.shape[1] - 1 # Number of attribute features base_entropy = calc_information_entropy(data) # Calculate the entropy of information for index in the total data set range(feature_number): Feat_list = [item[index] for item in data] Feat_set = set(Feat_list) new_entropy, iv, gini_index = 0.0, 0.0, 0.0 for set_item in Feat_set: Sub_data = handle_data(data, index, set_item) temp_df = pd.DataFrame(sub_data) yes_proportion = temp_df[temp_df[temp_df.shape[1]-1] == "Y"].shape[0] / len(sub_data) gini_value = 1 - (yes_proportion ** 2) - ((1- yes_proportion) ** 2) proportion = len(sub_data) / Float (data.shape[0]) # calculate the proportion of the subset new_entropy += PROPORTION * calc_information_entropy(np.array(sub_data)) iv += -proportion * np.log2(proportion) gini_index += gini_value * proportion temp_information_gain = base_entropy - New_entropy # calculate gain_ratio = temp_information_gain/iv print(" the information gain corresponding to the attribute %d is %.3f, the gain rate is %.3f, Gini index is %.3f" % (index + 1, temp_information_gain, gain_ratio, gini_index)) if __name__ == "__main__": data = establish_data() calc_gini_index(data)Copy the code

There are three indexes to select attribute features preferfully, namely information gain, gain rate and Gini index. A quick summary of the above:

  • The higher the information gain of attribute features is, it should be selected preferentially, which is often used in ** ID3ID3ID3 algorithm **
  • The higher the gain rate of the attribute characteristics, it should be selected preferentially, and is commonly used with ** C4.5C4.5C4.5 algorithm **
  • Attribute characteristics of the Nicky index is low, in principle, should be preferentially selected, often used in ** cartcart algorithm **

This paper mainly discusses the theory of decision tree, introduces what decision tree is, and three decision criteria which should be selected preferentially when generating decision tree. Having studied SVM or read several SVM articles written by Taoye before, we can find that decision tree is much simpler than SVM, without too many complicated formula derivation. Other aspects of decision trees, such as the generation, visualization and pruning of decision trees, will be covered in the next few articles.

I am Taoye, love study, love to share, is keen on all kinds of technology, the study of anime like playing chess, listening to music, chat, hoping to worlds to record your growth process as well as the life intravenous drip, also hope to be able to strong more within the circle of like-minded friends, more welcome visiting WeChat princess: cynicism Coder.

References:

[1] Peter Harrington, Posts and Telecommunications Press [2] Statistical Learning Methods: Li Hang, 2nd edition, Tsinghua University Press [3] Machine Learning: Zhou Zhihua, Tsinghua University Press

Recommended reading

SVM print(“Hello, NumPy! “) ) do what what not, eat the first Taoye penetration into a black platform headquarters, the truth behind the very fear of “Tai Hua database” -SQL statement execution, what did the bottom? Over the years, we’ve played Git, Hexo+Github is a deep learning environment based on Ubuntu+Python+Tensorflow+Jupyter Notebook The correct way to open ElasticSearch, Kibana, logStash