“This is the 13th day of my participation in the November Gwen Challenge. See details: The Last Gwen Challenge 2021”.

A brief introduction to decision tree algorithm

The source of decision tree idea is very simple, the conditional branch structure in program design is if-else structure, and the earliest decision tree is a classification learning method using this kind of structure to divide data

Decision tree: a tree structure in which each internal node represents a judgment on an attribute, each branch represents the output of a judgment result, and finally each leaf node represents a classification result. Essentially, it is a tree composed of multiple judgment nodes.

If the process of decision tree is quantified, knowledge in information theory is needed: information entropy and information gain

Principle of decision tree classification

2.1 the entropy

2.1.1 concept

Entropy: A measure of disorder

The more ordered the system, the lower the entropy; The more chaotic or scattered the system, the higher the entropy.

The information entropy:

  1. A description of the completeness of information

    When the ordered state of the system is consistent, the entropy value is lower where the data is more concentrated, and the entropy value is higher where the data is more dispersed.

  2. A description in terms of the order of information

    When the amount of data is consistent, the more ordered the system is, the lower the entropy value is. The more chaotic or scattered the system, the higher the entropy.

The “information entropy” is the most commonly used measure of the purity of a sample set.

Assume that the current sample set D the proportion of the first k sample for pk (k = 1, 2,…, ∣ y ∣) p_k (k = 1, 2, \ \ cdots, | | y) pk (k = 1, 2,…, ∣ y ∣)

Pk =CkDp_k=\frac{C^k}{D}pk=DCk, DDD is all the number of samples, CkC^kCk is the number of class K samples.

Then the information entropy of D is defined as:


E n t ( D ) = k = 1 n C k D l o g C k D = k = 1 n p k l o g 2 p k = p 1 l o g 2 p 1 p 2 l o g 2 p 2 p n l o g 2 p n Ent(D)=-\sum_{k=1}^{n}\frac{C^k}{D}log\frac{C^k}{D}=-\sum_{k=1}^{n}p_klog_2p_k=-p_1log_2p_1-p_2log_2p_2-\cdots-p_nlog_2p _n

Where, the smaller Ent(D) value is, the higher the purity of D is.

2.1.2 case

Suppose we didn't watch the World Cup, but wanted to know which team would be the champion. We could only guess whether a team would be the champion or not. Then the audience answered yes or no, we wanted to guess as little as possible. Dichotomy: if there are 16 teams, number them separately, first ask if they are between 1 and 8, if so, continue to ask if they are between 1 and 4, and so on, until you finally decide which team is the champion. If the number of teams is 16, we need to ask 4 times to get the final answer. So the entropy of the world champion message is 4. So how do you calculate that entropy of information is equal to 4? Ent(D) = - (p1 * logp1 + p2 * logp2 +... + p16 * logp16), where P1... , p16 is the probability of each of the 16 teams winning the championship. When the probability of each team winning is equal and it's 1/16: H = - (16 * 1/16 * log1/16) = 4 For each event with the same probability, entropy is highest, and the more uncertain the event is.Copy the code

2.2 The classification basis of decision tree — information gain

2.2.1 concept

Information gain: The difference in entropy before and after dividing a data set by a feature. Entropy can represent the uncertainty of the sample set, and the greater the entropy, the greater the uncertainty of the sample. Therefore, the difference of set entropy before and after partition can be used to measure the partition effect of current features on sample set D.

Info gain = Entroy (front) – Entroy (back)

Note: Information gain refers to the degree to which the information entropy of class Y is reduced when the information of feature X is known

Definition and formula:

Assume discrete attribute has a V a possible values {a1, a2,…, aV} \ {a ^ 1 ^ 2 a, \ cdots, a ^ V \} {a1, a2,…, aV}

If a is used to divide sample set D, V branch nodes will be generated

Where the v branch node contains all the samples in D whose value is AVA ^vav on attribute A, denoted as DvD^vDv. We can according to the entropy formula of a given earlier DvD ^ vDv information entropy, and then considering the different sample sizes, different branch node contains weights to the branch node given ∣ Dv ∣ ∣ D ∣ \ frac {^ v | | D} {| D |} ∣ D ∣ ∣ Dv ∣.

Therefore, the “information gain” obtained by dividing sample set D with attribute A can be calculated.

Among them:

Features a D information Gain Gain from training set (D, a), defined as the information entropy of set D * * Ent (D) with a given feature Ent conditions under a D information entropy (D | a) * *, the difference between the the formula is:

Gain (D, a) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vdvdent (Dv) Gain (D, a) = Ent (D) – Ent (D | a) = Ent (D) – \ sum_ = 1} {v ^ {n} \ frac {D ^ v} {D} Ent ^ v (D) Gain (D, a) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vddvent (Dv)

The formula is explained in detail:

Calculation of information entropy:

Ent (D) = – = 1 nckdlogckdent ∑ k (D) = – \ sum_ {k = 1} ^ {n} \ frac {C ^ k} {D} log \ frac {C ^ k} {D} Ent (D) = – ∑ ndcklogdck k = 1

Calculation of conditional entropy:

Ent = ∑ (D ∣ a) v = 1 vdvdent (Dv) = – = ∑ v 1 VDVD ∑ k = 1 kckvdvlogckvdvent (D | a) = \ sum_ = 1} {v ^ {n} \ frac {D ^ v} {D} Ent ^ v (D) = – \ sum_ = 1} {v ^ {n} \ frac {D ^ V} {D} \ sum_ {k = 1} ^ {k} \ frac {C ^ {kv}} {^ n} D log \ frac {C ^ {kv}}} {D ^ v Ent = ∑ (D ∣ a) v = 1 vddvent (Dv) = – = ∑ v 1 VDDV ∑ kdvckvlogdvckv k = 1

Among them:

DvD^vDv represents the number of samples contained in the VTH branch node of the A attribute

CkvC^{kv}Ckv represents the number of samples contained under the KTH category in the number of samples contained in the VTH branch node of attribute A

In general, the greater the information gain, the greater the “purity boost” obtained by partitioning using attribute A. Therefore, we can use information gain to select partition attributes of decision tree. The famous ID3 decision tree learning algorithm selects partition attributes based on information gain.

The ID in ID3 is short for Iterative Dichotomiser, an Iterative Dichotomiser

2.2.2 case

In the picture below, the first column is forum number, the second column is gender, the third column is activity, and the last column is user loss.

We need to solve the problem of which of the two characteristics has a greater impact on churn: gender or activity

This problem can be solved by calculating the information gain, statistically speaking, of the right table information

Positive refers to the Positive sample (lost), Negative refers to the Negative sample (not lost), and the following values refer to the number of people under different divisions.

Three entropies can be obtained:

The overall entropy:

Ent (D) = – 515 log2 log2 (515) – 1015 (1015) = 0.9182 Ent (D) = – \ frac5 {15} log_2 (\ frac5 {15}) – \ frac {10} {15} log_2 (\ frac {10} {15}) = 0.9182 En T (D) = – 155 log2 (155) – 1510 log2 (1510) = 0.9182

Gender entropy:

Ent = ∑ (D ∣ a) v = 1 vdvdent (Dv) = D1DEnt (D1) + D2DEnt (D2) Ent (D | a) = \ sum_ = 1} {v ^ v \ frac {D} ^ v DEnt (D) ^ v = \ frac {1} D ^ DEnt (D) ^ 1 + \ frac {2} D ^ DEn T ^ 2 (D) Ent = ∑ (D ∣ a) v = 1 vddvent (Dv) = DD1Ent (D1) + DD2Ent (D2)

Ent (D1) = – 38 log2 (38) – 58 log2 (58) = 0.9543 Ent ^ 1 (D) = – \ frac38log_2 (\ frac38) – \ frac58log_2 (\ frac58) = 0.9543 Ent (D1) = – 83 log2 (83) – 85 l Og2 (85) = 0.9543

Ent (D2) = – 27 log2 (27) – 57 log2 (57) = 0.8631 Ent (D ^ 2) = – \ frac27log_2 (\ frac27) – \ frac57log_2 (\ frac57) = 0.8631 Ent (D2) = – 72 log2 (72) – 75 l Og2 (75) = 0.8631

Gender information gain:

Gain (D, a) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vdvdent (Dv) Gain (D, a) = Ent (D) – Ent (D | a) = Ent (D) – \ sum_ \ frac {v = 1} ^ v ^ v} {D DEnt ^ v (D) Gain (D, a ) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vddvent (Dv)

= Ent (D) – D1DEnt (D1) – D2DEnt (D2) = Ent (D) – \ frac {1} D ^ DEnt ^ 1 (D) – \ frac {2} D ^ DEnt (D ^ 2) = Ent (D) – DD1Ent (D1) – DD2Ent (D2)

= 0.9182 0.9543-715-815 ∗ ∗ = 0.9182-0.8631 \ frac8 0.9543 \ {15} * frac7 {15} * 0.8631 = 0.9182-0.9543-157 ∗ ∗ 0.8631 158

= = 0.0064 0.0064 = 0.0064

Entropy of activity:

Ent (D1) = – 06 log2 (06) – 66 log2 (66) Ent ^ 1 (D) = – \ frac06log_2 (\ frac06) – \ frac66log_2 (\ frac66) Ent (D1) = – 60 log2 log2 (60) – 66 (66).

Ent (D2) = – 15 log2 (15) – 45 log2 (45) Ent (D ^ 2) = – \ frac15log_2 (\ frac15) – \ frac45log_2 (\ frac45) Ent (D2) = – 51 log2 (51) – 54 log2 (54)

Ent (D3) = – 44 log2 (44) – 04 log2 (04) Ent ^ 3 (D) = – \ frac44log_2 (\ frac44) – \ frac04log_2 (\ frac04) Ent (D3) = – 44 log2 (44) – 40 log2 (40)

Activity information gain:

Gain (D, a) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vdvdent (Dv) Gain (D, a) = Ent (D) – Ent (D | a) = Ent (D) – \ sum_ \ frac {v = 1} ^ v ^ v} {D DEnt ^ v (D) Gain (D, a ) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vddvent (Dv)

= Ent (D) – D1DEnt (D1) – D2DEnt (D2) – D3DEnt (D3) = Ent (D) – \ frac {1} D ^ DEnt ^ 1 (D) – \ frac {2} D ^ DEnt (D ^ 2) – \ frac {D ^ 3} DEnt ^ 3 (D) = Ent (D) – DD1En T (D1) – DD2Ent (D2) – DD3Ent (D3)

= 615 ∗ 0-0.9182-0.7219-515 = 415 ∗ 0 = 0.9182 \ frac6 {15} * 0 – \ frac5 {15} = 0.7219 \ frac4 {15} * 0 = 156 ∗ 0-0.9182-0.7219-155 = 154 ∗ 0

= = 0.6776 0.6776 = 0.6776

The information gain of activity is larger than that of gender, that is to say, activity has a greater impact on user churn than gender.

When making feature selection or data analysis, we should focus on the activity index.

2.3 Decision tree is divided according to information gain rate

2.3.1 concept

In fact, the information gain criterion has a preference for attributes with a large number of values. In order to reduce the possible adverse effects of this preference, the famous C4.5 decision tree algorithm [Quinlan, 1993J] does not directly use information gain, but uses “gain ratio” to select the optimal partition attributes.

Gain rate: Gain rate is defined by the ratio of the information Gain(D, A) to the intrinsic value corresponding to attribute A [Quinlan, 1993J].


G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D,a)=\frac{Gain(D,a)}{IV(a)}

Among them:

IV (a) = – = 1 vdvdlogdvdiv ∑ v = (a) – \ sum_ \ frac {v = 1} ^ v ^ v} {D} {D log \ frac {D} ^ v DIV = (a) – ∑ vddvlogddv v = 1

The greater the number of possible values of attribute A (that is, the greater v), the greater the value of IV(a) will generally be.

2.3.2 Case – Determine whether the activity will be carried out according to the weather

As shown below, the first column is weather, the second column is temperature, the third column is humidity, the fourth column is wind speed, and the last column is whether the activity is going on.

We need to solve: according to the table data below, judge whether the activity will be carried out in the corresponding weather?

This data set has four attributes, attribute set A={weather, temperature, humidity, wind speed}, and two category labels, category set L={proceed, cancel}. The statistics are as follows:

A. Calculate category information entropy

Category information entropy represents the sum of uncertainties of all categories in all samples. According to the concept of entropy, the more entropy there is, the more uncertainty there is, and the more information it takes to figure things out.

Ent (D) = – 914 log2 log2 (914) – 514 (514) = 0.940 Ent (D) = – \ frac9 {14} log_2 (\ frac9 {14}) – \ frac5 {14} log_2 (\ frac5 {14}) = 0.940 Ent (D) = – 149 l Og2 (149) – 145 log2 (145) = 0.940

B. Calculate the information entropy of each attribute

The information entropy of each attribute corresponds to a kind of conditional entropy. It represents the sum of uncertainties of various categories under the condition of a certain attribute. The larger the information entropy of the attribute is, the less “pure” the sample category of the attribute is.

A =” weather “(5 “sunny”, 4 “cloudy”, 5 “rain”, total: 14)


E n t ( D a ) = 5 14 [ 2 5 l o g 2 2 5 3 5 l o g 2 3 5 ] + 4 14 [ 4 4 l o g 2 4 4 ] + 5 14 [ 2 5 l o g 2 2 5 3 5 l o g 2 3 5 ] Ent(D|a)=\frac5{14}*[-\frac25log_2\frac25-\frac35log_2\frac35]+\frac4{14}*[-\frac44log_2\frac44]+\frac5{14}*[-\frac25log _2\frac25-\frac35log_2\frac35]

= = 0.911 0.911 = 0.911

A =” temperature “(4 “cold”, 6 “moderate”, 4 “hot”)


E n t ( D a ) = 4 14 [ 2 4 l o g 2 2 4 2 4 l o g 2 2 4 ] + 6 14 [ 4 6 l o g 2 4 6 2 6 l o g 2 2 6 ] + 4 14 [ 3 4 l o g 2 3 4 1 4 l o g 2 1 4 ] Ent(D|a)=\frac4{14}*[-\frac24log_2\frac24-\frac24log_2\frac24]+\frac6{14}*[-\frac46log_2\frac46-\frac26log_2\frac26]+\fr ac4{14}*[-\frac34log_2\frac34-\frac14log_2\frac14]

= = 0.911 0.911 = 0.911

A =” humidity “(7 “normal”, 7 “high”)


E n t ( D a ) = 0.789 Ent (D | a) = 0.789

A =” wind speed “(8 “weak”, 6 “strong”)


E n t ( D a ) = 0.892 Ent (D | a) = 0.892

C. Calculate information gain

Information gain = entropy – conditional entropy, in this case, category information entropy – attribute information entropy, which represents the degree of information uncertainty reduction. If the information gain of an attribute is larger, it means that using this attribute for sample division can better reduce the uncertainty of the sample after classification. Of course, selecting this attribute can accomplish our classification goal faster and better.

Information gain is the feature selection index of ID3 algorithm.

Gain(D, weather)= 0.940-0.694 = 0.246

Gain(D, temperature)= 0.940-0.911 = 0.029

Gain(D, humidity)= 0.940-0.789 = 0.151

Gain(D, wind speed)= 0.940-0.892 = 0.048

Suppose we add a column of “number” (1–14) to the front of the data in figure 1. If “number” is also considered as a candidate attribute, we find that the value of this attribute is 0, that is, its information gain is 0.940, in the process of calculating the information entropy of each attribute according to the previous steps. But obviously this classification, the result that you end up with is not a generalization. In this case, effective classification features cannot be selected according to information gain. Therefore, C4.5 chooses to use information gain rate to improve ID3.

D. Calculate the attribute splitting information measure

In this paper, the quantity and size of branches of a certain attribute are considered by using the splitting information metric, which is called the intrinsic information of the attribute. Information gain ratio using information gain/intrinsic information causes the importance of the attribute to decrease with the increase of intrinsic information (that is, if the attribute itself is highly uncertain, I will be less inclined to select it), thus compensating for the use of information gain alone.


I V ( The weather ) = 5 14 l o g 2 5 14 5 14 l o g 2 5 14 4 14 l o g 2 4 14 IV (weather) = – \ frac5 {14} log_2 \ frac5 {14} – \ frac5 {14} log_2 \ frac5 {14} – \ frac4 {14} log_2 \ frac4 {14}


I V ( The temperature ) = 4 14 l o g 2 4 14 IV (temperature) = – \ frac4 {14} log_2 \ frac4 {14}


I V ( humidity ) = 7 14 l o g 2 7 14 7 14 l o g 2 7 14 = 1.0 IV (humidity) = – \ frac7 {14} log_2 \ frac7 {14} – \ frac7 {14} log_2 \ frac7 {14} = 1.0


I V ( The wind speed ) = 9 14 l o g 2 9 14 5 14 l o g 2 5 14 = 0.940 IV (wind speed) = – \ frac9 {14} log_2 \ frac9 {14} – \ frac5 {14} log_2 \ frac5 {14} = 0.940

E. Calculate information gain


G a i n _ r a t i o ( D . The weather ) = G a i n ( D . The weather ) I V ( The weather ) = 0.246 1.577 = 0.156 Gain \ _ratio (D, weather) = \ frac {Gain (D, weather)} {IV (weather)} = \ frac {0.246}} {1.577 = 0.156


G a i n _ r a t i o ( D . The temperature ) = G a i n ( D . The temperature ) I V ( The temperature ) = 0.029 1.556 = 0.0186 Gain \ _ratio (D, temperature) = \ frac {Gain (D, temperature)} {IV (temperature)} = \ frac {0.029}} {1.556 = 0.0186


G a i n _ r a t i o ( D . humidity ) = G a i n ( D . humidity ) I V ( humidity ) = 0.151 1.0 = 0.151 Gain \ _ratio (D, humidity) = \ frac {Gain (D, humidity)} {IV (humidity)} = \ frac {0.151}} {1.0 = 0.151


G a i n _ r a t i o ( D . The wind speed ) = G a i n ( D . The wind speed ) I V ( The wind speed ) = 0.048 0.940 = 0.0510 Gain \ _ratio (D, wind speed) = \ frac {Gain (D, wind speed)} {IV (wind speed)} = \ frac {0.048}} {0.940 = 0.0510

Weather has the highest information gain rate and is selected as the split attribute. It is found that after splitting, the category is “pure” under the condition that the weather is “Yin”, so it is defined as leaf node, and the node that is not “pure” is selected to continue splitting.

Repeat the process a through E among the children until all leaves are “pure”.

Now let’s summarize the algorithm flow of C4.5

while(Current node"Pure") :1.Calculate the category entropy of the current node (by category value)2.Calculate the attribute entropy of the current stage (calculate according to the attribute value and the category value)3.Calculate the classification information measure of each attribute4.Calculate the information gain rate end of each attributewhileThe current stage is set to the leaf nodeCopy the code

2.3.3 Advantages of C4.5

  1. Use the information gain rate to select attributes

    It overcomes the deficiency that the information gain is used to select the attribute with more selection value.

  2. A method of post pruning was used

    Avoid uncontrolled growth in the height of the tree and avoid over-fitting the data

  3. Handling of missing values

    In some cases, the available data may be missing values for certain attributes. Suppose < x, c(x) > is A training instance in sample set S, but the value of its attribute A, A(x), is unknown.

    One strategy to deal with the missing attribute value is to assign the most common value of the attribute in the training instance corresponding to its node N;

    Another, more complicated strategy is to assign A probability to each possible value of A.

    For example, given A Boolean attribute A, if node N contains six instances where A=1 and four A=0 are known, then the probability that A(x)=1 is 0.6 and that A(x)=0 is 0.4. Thus, 60% of instance X is assigned to the A=1 branch and 40% to the other branch.

    C4.5 uses this approach to handle missing attribute values.

2.4 The decision tree is divided according to three — Gini value and Gini index

Against 2.4.1 concept

The CART decision tree uses the gini index to select partition attributes

CART is short for Classification and Regression Tree, a well-known decision Tree learning algorithm that is available for both Classification and Regression tasks

Gini(D) : The probability that two samples randomly selected from dataset D have inconsistent category markers. Therefore, the smaller Gini (D) value is, the higher the purity of data set D is.

The purity of dataset D can be measured by gini value:

Description: | y | to tell the number of categories, which part of the k for this category

Gini index Gini_index (D) : Generally, the attribute with the lowest Gini coefficient after partitioning is selected as the optimized sub-attribute.

Gini_index (D, a) = ∑ v = 1 vdvdgini (Dv) Gini \ _index (D, a) = \ sum_ \ frac {v = 1} ^ v ^ v} {D DGini Gini_index ^ v (D) (D, a) = ∑ v = 1 vddvgini (Dv)

2.4.2 case

Please make a decision tree according to the gini index according to the following table.

The serial number Is there a room Marital status Annual income Defaulting on loans
1 yes single 125k no
2 no married 100k no
3 no single 70k no
4 yes married 120k no
5 no divorced 95k yes
6 no married 60k no
7 yes divorced 220k no
8 no single 85k yes
9 no married 75k no
10 no single 90k yes

1. For the convenience of calculation, we calculate each attribute. Annual income is a continuous value, which will be discussed later

2. The Gini indices of the non-serial label attributes of the dataset (whether they own a house, marital status, and annual income) were calculated respectively, and the attribute with the smallest Gini index was taken as the attribute of the root node of the decision tree.

3. When dividing according to whether there are houses or not, Gini index is calculated as follows:


G i n i ( Have a room ) = 1 ( 0 3 ) 2 ( 3 3 ) 2 = 0 Gini (a) = 1 – (\ frac03) ^ 2 – (\ frac33) ^ 2 = 0


G i n i ( There is no room ) = 1 ( 3 7 ) 2 ( 4 7 ) 2 = 0.4898 Gini (no room) = 1 – (\ frac37) ^ 2 – (\ frac47) ^ 2 = 0.4898


G i n i _ i n d e x ( D . Is there a room ) = 7 10 0.4898 + 3 10 0 = 0.343 Gini \ _index (D, whether to have room) = \ frac7 {10} * 0.4898 + \ frac3 {10} * 0 = 0.343

4. If it is divided according to the attribute of marital status, there are three possible values {married, single, divorced} for the attribute marital status. The Gini coefficient gain after partitioning is calculated respectively.

{married} | {single,divorced}

{single} | {married,divorced}

{divorced} | {single,married}

Grouped into {I} | {single, divorced} :


G i n i _ i n d e x ( D . Marital status ) = 4 10 0 + 6 10 [ 1 ( 3 6 ) 2 ( 3 6 ) 2 ] = 0.3 Gini \ _index (D, marital status) = \ frac4 {10} * 0 + \ frac6 {10} * [1 – (\ frac36) ^ 2 – (\ frac36) ^ 2] = 0.3

When grouped into {use} | {I, divorced} :


G i n i _ i n d e x ( D . Marital status ) = 4 10 0.5 + 6 10 [ 1 ( 1 6 ) 2 ( 5 6 ) 2 ] = 0.367 Gini \ _index (D, marital status) = \ frac4 {10} * 0.5 + \ frac6 {10} * [1 – (\ frac16) ^ 2 – (\ frac56) ^ 2] = 0.367

When grouped into {divorced} | {I} single, when:


G i n i _ i n d e x ( D . Marital status ) = 2 10 0.5 + 8 10 [ 1 ( 2 8 ) 2 ( 6 8 ) 2 ] = 0.4 Gini \ _index (D, marital status) = \ frac2 {10} * 0.5 + \ frac8 {10} * [1 – (\ frac28) ^ 2 – (\ frac68) ^ 2] = 0.4

5. Similarly, the annual income Gini can be obtained:

If the annual income attribute is numerical, the data need to be sorted in ascending order first, and then the samples are divided into two groups by the intermediate value of adjacent values from smallest to largest. For example, when faced with an annual income of 60 and 70, we calculate the median value of 65. The Gini index was calculated with the median value of 65 as the segmentation point.

When the node is 65: {annual income}= 110 ∗ [1 – (11) 2 – (01) 2] + 910 ∗ [1 – (69) 2 – (39) 2] = 0.4 \ frac1 {10} * [1 – (\ frac11) ^ 2 – (\ frac01) ^ 2] + \ frac9 {10} * [1 – (\ frac69) ^ 2 – (\ frac39) ^ 2) ∗ = 0.4101 [1 – (11) 2 – (10) 2] + 109 ∗ [1 – (96) 2 – (93) 2] = 0.4

When the node is 75: {annual income}= 210 ∗ [1 – (22) 2 – (02) 2] + 810 ∗ [1 – (58) 2 – (38) 2] = 0.375 \ frac2 {10} * [1 – (\ frac22) ^ 2 – (\ frac02) ^ 2] + \ frac8 {10} * [1 – (\ frac58) 2 – (\ frac38) ^ ^ 2] = 0.375102 ∗ [1 – (22) 2 – (20) 2] + 108 ∗ [1 – (85) 2 – (83) 2] = 0.375

When the node is 80: {annual income}= 310 ∗ [1 – (33) 2 – (03) 2] + 710 ∗ [1 – (47) 2 – (37) 2] = 0.343 \ frac3 {10} * [1 – (\ frac33) ^ 2 – (\ frac03) ^ 2] + \ frac7 {10} * [1 – (\ frac47) 2 – (\ frac37) ^ ^ 2] = 0.343103 ∗ [1 – (33) 2 – (30) 2] + 107 ∗ [1 – (74) 2 – (73) 2] = 0.343

Nodes of 87.7: income {} = 410 ∗ [1 – (34) 2 – (14) 2] + 610 ∗ [1 – (46) 2 – (26) 2] = 0.417… \ frac4 {10} * [1 – (\ frac34) ^ 2 – (\ frac14) ^ 2] + \ frac6 {10} * [1 – (\ frac46) ^ 2 – (\ frac26) ^ 2] = 0.417… 104 ∗ [1 – (43) 2 – (41) 2] + 106 ∗ [1 – (64) 2 – (62) 2] = 0.417…

According to the calculation, two of the three attributes have the smallest index to divide the root node: annual income attribute and marital status, and their index is 0.3. In this case, marital status is selected. According to the classification of marital status, married has not defaulted on loans. The loans are pure, and the statistics of single and Divorced branch are as follows

6. Next, use the same method to calculate the remaining attributes separately. For whether there is a house attribute, the following can be obtained:


G i n i _ i n d e x ( D . Is there a room ) = 2 6 0 + 4 6 [ 1 ( 3 4 ) 2 ( 1 4 ) 2 ] = 0.25 Gini \ _index (D, whether to have room) = 0 + \ \ frac26 * frac46 * [1 – (\ frac34) ^ 2 – (\ frac14) ^ 2] = 0.25

7. For annual income attributes, there are:

8. After classification according to whether there is a house or not, those who have a house do not default, which is pure. In the branch without a house, there is only the annual income attribute.

After the above process, the decision tree is constructed as follows:

Now let’s summarize the CART algorithm flow

while(Current node"Pure") : 1. Traverse each segmentation method of each variable to find the best segmentation point 2. It splits into two nodes, N1 and N2 endwhileEach node is pure enoughCopy the code

2.5 summarize

2.5.1 Two types of decision tree variables

  1. Numeric: The variable type is an integer or floating point number, such as “annual income” in the previous example. Use “>=”, “>”, “<” or “<=” as segmentation conditions (after sorting, the existing segmentation can be used to optimize the time complexity of the segmentation algorithm).
  2. Nominal: Similar to enumeration in programming languages, variables can only be selected from a limited number of options, such as’ marital status’ in the previous example, which can only be ‘single’, ‘married’ or ‘divorced’, separated by ‘=’.

2.5.2 How to evaluate the quality of segmentation points

If a segmentation point can divide all the current nodes into two categories, so that each category is “pure”, that is, there are many records in the same category, then it is a good segmentation point.

For example, in the above example, “owning a house”, records can be divided into two categories, “yes” nodes can all repay debts, very “pure”; The “no” node is not very “pure”, but the difference between the sum of the purity of the two nodes and the purity of the original node is the largest, so it is divided in this way.

The greedy algorithm was used to construct the decision tree, and only the case with the largest purity difference was considered as the segmentation point.

2.5.3 Comparison of heuristic functions of common decision trees

Information entropy: Ent (D) = – ∑ pkent npklog k = 1) 2 (D) = – \ sum_ {k = 1} ^ 2 p_kent np_klog) (D) = – ∑ 2 pk npklog k = 1)

Information gain –ID3 decision tree: Gain (D, a) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vdvdent (Dv) Gain (D, a) = Ent (D) – Ent (D | a) = Ent (D) – \ sum_ \ frac {v = 1} ^ v ^ v} {D DEnt ^ v (D) Gain (D, A) = Ent (D) – Ent (D ∣ a) = Ent (D) – ∑ v = 1 vddvent (Dv)

Information Gain rate, C4.5 decision tree: Gain_ratio (D, a) = Gain IV (a) (D, a) Gain \ _ratio (D, a) = \ frac {Gain (D, a)} {IV (a)} Gain_ratio (D, a) = IV (a) Gain (D, a)

Gini value: Gini (D) = ∑ k = 1 ∣ y ∣ ∑ k ‘indicates KPKPK’ = 1 – ∑ k = 1 ∣ y ∣ pk2Gini (D) = \ sum_ {k = 1} ^ {| y |} \ sum_ {k ‘\ ne P_kp_ k} {k} = 1 – \ sum_ {k = 1} ^ {| y |} p_k ^ 2 gini (D) = ∑ k = 1 ∣ y ∣ ∑ k ‘ = KPKPK’ = 1 – ∑ k = 1 ∣ y ∣ pk2

Gini index –CART Decision Tree: Gini_index (D, a) = ∑ v = 1 vdvdgini (Dv) Gini \ _index (D, a) = \ sum_ \ frac {v = 1} ^ v ^ v} {D DGini Gini_index ^ v (D) (D, a) = ∑ v = 1 vddvgini (Dv)

The name of the Put forward the time Branch way note
CART 1984 The Gini coefficient It can be classified and regressed. it can handle both discrete and continuous attributes
ID3 1975 Information gain ID3 can only form a decision tree for data sets with discrete attributes
C4.5 1993 Information gain rate After optimization, the ID3 branch process always prefers the attribute with more selection values
  1. ID3 algorithm

    Existing disadvantages

    (1) ID3 algorithm adopts information gain as evaluation criterion when selecting branch attributes in root nodes and internal nodes. The disadvantage of information gain is that it tends to select attributes with a large number of values, which may not provide much valuable information in some cases.

    (2) ID3 algorithm can only construct decision tree for the data set whose description attribute is discrete attribute.

  2. C4.5 algorithm

    Improvements made (why C4.5 is better)

    (1) Use information gain rate to select attributes

    (2) It can handle continuous numerical attributes

    (3) A post pruning method was used

    (4) Processing of missing values

    Advantages and disadvantages of C4.5 algorithm

    Advantages:

    The classification rules are easy to understand and accurate.

    Disadvantages:

    In the process of tree construction, sequential scanning and sorting of data sets are needed for many times, which results in low efficiency of the algorithm.

    In addition, C4.5 is only suitable for data sets that can reside in memory, and programs cannot run when the training set is too large to fit in memory.

  3. CART algorithm

    Compared with C4.5 algorithm, CART algorithm adopts a simplified binary tree model, and feature selection adopts approximate Gini coefficient to simplify calculation.

    C4.5 is not necessarily a binary tree, but CART is definitely a binary tree.

    At the same time, no matter ID3, C4.5 or CART, when making feature selection, they all choose the optimal feature to make classification decision. However, in most cases, classification decision should not be determined by a single feature, but by a group of features. The resulting decision tree is more accurate. The tree is called a multi-variate decision tree. When selecting the optimal feature, the multivariable decision tree does not select an optimal feature, but a linear combination of the optimal features to make the decision. The representation of this algorithm is OC1, which I won’t talk about here.

    If the sample changes a little bit, it can cause a drastic change in the structure of the tree. This can be solved by means of random forests in ensemble learning.

3. CART pruning

3.1 Why pruning

  • The horizontal axis represents the total number of nodes in the process of creating the decision tree, and the vertical axis represents the prediction accuracy of the decision tree.

  • The solid line shows the precision of the decision tree on the training set, while the dotted line shows the precision measured on an independent test set.

  • As the tree grows, the accuracy on the training sample increases monotonically, whereas the accuracy measured on the independent test sample increases first and then decreases.

    Reasons for this:

    Cause 1: noise, sample conflict, that is, wrong sample data.

    Reason 2: Features, or attributes, cannot be completely used as classification criteria.

    Reason 3: The regularity of coincidences and the amount of data are not large enough.

3.2 Common pruning methods

3.2.1 pre pruning

  1. The minimum number of samples contained in each node, for example, 10. If the total number of samples of this node is less than 10, the node will not be divided.

  2. Specify the height or depth of the tree, for example, the maximum depth of the tree is 4.

  3. Specifies that the entropy of nodes is less than a certain value, and is no longer divided. As the tree grows, the accuracy on the training sample increases monotonically, whereas the accuracy measured on the independent test sample increases first and then decreases.

3.2.2 after pruning

After pruning, the simplified pruning decision tree can be obtained by pruning the over-fit decision tree.

Feature engineering – feature extraction

4.1 Feature Extraction

4.4.1 definition

Converts arbitrary data, such as text or images, into digital features that can be used for machine learning

Note: Eigenvaluation is used for computers to better understand data

  • Feature extraction classification:
    • Dictionary feature extraction (Feature discretization)
    • Text feature extraction
    • Image feature extraction

4.1.2 Feature extraction API

sklearn.feature_extraction
Copy the code

4.2 Dictionary feature extraction

Function: Eigenvalue dictionary data

  • Sklearn. Feature_extraction. DictVectorizer (sparse = True,…).
    • DictVectorizer.fit_transform(X)
      • X: the return value of the dictionary or iterator containing the dictionary
      • Return sparse matrix
    • Dictvectorizer.get_feature_names () Returns the category name

2 application

We performed feature extraction on the following data

[{'city': 'Beijing'.'temperature':100},
{'city': 'Shanghai'.'temperature':60},
{'city': 'shenzhen'.'temperature':30}]
Copy the code

4.2.2 Process analysis

  • Exemplified by DictVectorizer
  • Call the FIT_transform method to enter the data and transform it (note the return format)
from sklearn.feature_extraction import DictVectorizer


def dict_demo() :
    data = [{'city': 'Beijing'.'temperature': 100}, {'city': 'Shanghai'.'temperature': 60}, {'city': 'shenzhen'.'temperature': 30}]

    Instantiate a converter class
    transfer = DictVectorizer(sparse=False)
    # 2. Call fit_transform
    data = transfer.fit_transform(data)
    print("Result returned :\n", data)
    Print feature names
    print("Characteristic name: \n", transfer.get_feature_names())

    return None


dict_demo()
Copy the code

Note the results without the sparse=False parameter

Return result: (0.1)    1.0
  (0.3)    100.0
  (1.0)    1.0
  (1.3)    60.0
  (2.2)    1.0
  (2.3)    30.0Feature name: ['Shanghai city ='.'Beijing city ='.'shenzhen city ='.'temperature']
Copy the code

This result is not what we want to see, so add parameters to get the desired result:

Return result: [[0.    1.    0.  100.]
 [   1.    0.    0.   60.]
 [   0.    0.    1.   30.[[[[]]'Shanghai city ='.'Beijing city ='.'shenzhen city ='.'temperature']
Copy the code

A similar effect was achieved in learning discretization in PANDAS.

We call this data-processing technique “one-hot” coding.

4.3 Text feature extraction

Function: Eigenvalue text data

  • sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
    • Returns the word frequency matrix
    • CountVectorizer.fit_transform(X)
      • X: text or an iterable containing a text string
      • Returned value: Returns the SPARSE matrix
    • Countvectorizer.get_feature_names () Returns a list of words
  • sklearn.feature_extraction.text.TfidfVectorizer

This application

Feature extraction was performed on the following data

["life is short,i like python"."life is too long,i dislike python"]
Copy the code

4.3.2 Process analysis

  • Instantiate the class CountVectorizer
  • Call the FIT_transform method to enter the data and convert it (note the return format, with toarray() for sparse matrix conversion)
from sklearn.feature_extraction.text import CountVectorizer

def text_count_demo() :
    """ Feature extraction of text, countvetorizer :return: None """

    data = ["life is short,i like like python"."life is too long,i dislike python"]
    Instantiate a converter class
    # Transfer = CountVectorizer(sparse=False
    transfer = CountVectorizer()
    # 2. Call fit_transform
    data = transfer.fit_transform(data)
    print("Result of text feature Extraction: \n", data.toarray())
    print("Return characteristic name: \n", transfer.get_feature_names())

    return None
Copy the code

Return result:

Results of text feature extraction: [[0 1 1 2 0 1 1 0]
 [1 1 1 0 1 1 0 1] return feature name: ['dislike'.'is'.'life'.'like'.'long'.'python'.'short'.'too']
Copy the code

Q: What if we replace the data with Chinese?

"Life is short. I like Python."."Life is too long. I don't like Python."
Copy the code

So what you end up with is this

After careful analysis, it will be found that English is separated by Spaces by default. In fact, it has achieved the effect of a word segmentation, so we need to carry out word segmentation for Chinese

4.3.3 Jieba word processing

  • jieba.cut()
    • Returns a generator made up of words

Jieba library needs to be installed

pip install jieba
Copy the code

4.3.4 Case Analysis

Eigenvaluate the following three sentences J

Today is very cruel, tomorrow is more cruel, the day after tomorrow is very beautiful, but absolutely most will die in tomorrow night, so everyone should not give up today. The light we see coming from distant galaxies was emitted millions of years ago, so that when we look at the universe, we're looking into its past. If you only know something one way, you don't really know it. The secret to knowing what things really mean depends on how we relate them to what we know.Copy the code
  • Analysis of the
    • 1. Prepare sentences by jieba.cut
    • Instantiation CountVectorizer
    • Turn the result into a string as the input value for fit_transform

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cut_word(text) :
    ————>" I love Beijing Tiananmen square ":param text: :return: text ""
    # Use stutter to segment Chinese string
    text = "".join(list(jieba.cut(text)))

    return text

def text_chinese_count_demo2() :
    Return: None """
    data = [Today is cruel, tomorrow is crueler, the day after tomorrow is beautiful, but absolutely most people die tomorrow night, so don't give up on today.."The light we see coming from distant galaxies was coming millions of years ago, so that when we look at the universe, we're looking into its past."."If you only know something one way, you don't really know it. The secret to knowing what things really mean depends on how we relate them to what we already know."]
    # Convert raw data into words
    text_list = []
    for sent in data:
        text_list.append(cut_word(sent))
    print(text_list)

    Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = CountVectorizer()
    # 2. Call fit_transform
    data = transfer.fit_transform(text_list)
    print("Result of text feature Extraction: \n", data.toarray())
    print("Return characteristic name: \n", transfer.get_feature_names())

    return None
Copy the code

Return result:

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/mz/tzf2l3sx4rgg6qpglfb035_r0000gn/T/jieba.cache
Loading model cost 1.032 seconds.
[Today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but absolutely most people die tomorrow night, so don't give up today. '.'The light we see coming from very distant galaxies was coming millions of years ago, so that when we look at the universe, we are looking into its past. '.'If you only know something one way, you don't really know it. The secret to knowing what things really mean depends on how we relate them to what we know. ']
Prefix dictHas been built succesfully. Results of text feature extraction: [[2 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 1 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0 0] return feature name: ['a'.'not'.'don't'.'before'.'know'.'things'.'today'.'The light is in'.'Millions of years'.'a'.'depends on'.'only'.'the day after tomorrow'.'meaning'.'Most'.'how to'.'如果'.'the universe'.'我们'.'所以'.'give up'.'way'.'tomorrow'.'the galaxies'.'the night'.'something'."Cruel".'each'.'see'.'real'.'secret'.'absolute'.'good'.'contact'.'the past'.'还是'.'这样']
Copy the code

But what goes wrong when such word features are used for classification?

How do you deal with the high frequency of a word or phrase in multiple articles

4.3.5 TF-IDF text feature extraction

  • The main idea of TF-IDF is: if a certain word or phrase has a high probability of occurrence in one article and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.
  • Tf-idf function: It is used to evaluate the importance of a word to one of the documents in a document set or a corpus.
  1. The formula

    • Term frequency (TF) refers to the frequency with which a given word appears in the document

    • Inverse Document Frequency (IDF) is a measure of the universal importance of a word. The IDF of a particular term can be obtained by dividing the total number of files by the number of files containing the term and taking the logarithm base 10 of the resulting quotient

The final result can be understood as importance:

For example, suppose the total number of words in a passage is100And words"Very"the5Time, then"Very"The word frequency in the file is5/100=0.05. The file frequency (IDF) is calculated by dividing the number of files in the file set by the number of files present"Very"The number of documents of the word. So, if"Very"The word1.0000One file appeared, and the total number of files is10.000.000For example, the reverse file frequency is LG (10.000.000 / 1.0000) =3. The last"Very"The TF-IDF score for this document is0.05 * 3=0.15
Copy the code
  1. case

    from sklearn.feature_extraction.text import TfidfVectorizer
    import jieba
    
    def cut_word(text) :
        ————>" I love Beijing Tiananmen square ":param text: :return: text ""
        # Use stutter to segment Chinese string
        text = "".join(list(jieba.cut(text)))
    
        return text
    
    def text_chinese_tfidf_demo() :
        Return: None """
        data = [Today is cruel, tomorrow is crueler, the day after tomorrow is beautiful, but absolutely most people die tomorrow night, so don't give up on today.."The light we see coming from distant galaxies was coming millions of years ago, so that when we look at the universe, we're looking into its past."."If you only know something one way, you don't really know it. The secret to knowing what things really mean depends on how we relate them to what we already know."]
        # Convert raw data into words
        text_list = []
        for sent in data:
            text_list.append(cut_word(sent))
        print(text_list)
    
        Instantiate a converter class
        # transfer = CountVectorizer(sparse=False)
        transfer = TfidfVectorizer(stop_words=['a'.'not'.'don't'])
        # 2. Call fit_transform
        data = transfer.fit_transform(text_list)
        print("Result of text feature Extraction: \n", data.toarray())
        print("Return characteristic name: \n", transfer.get_feature_names())
    
        return None
    Copy the code

    Return result:

    Building prefix dict from the default dictionary ...
    Loading model from cache /var/folders/mz/tzf2l3sx4rgg6qpglfb035_r0000gn/T/jieba.cache
    Loading model cost 0.856 seconds.
    Prefix dict has been built succesfully.
    [Today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but absolutely most people die tomorrow night, so don't give up today. '.'The light we see coming from very distant galaxies was coming millions of years ago, so that when we look at the universe, we are looking into its past. '.'If you only know something one way, you don't really know it. The secret to knowing what things really mean depends on how we relate them to what we know. '] Results of text feature extraction: [[0.          0.          0.          0.43643578  0.          0.          0.
       0.          0.          0.21821789  0.          0.21821789  0.          0.
       0.          0.          0.21821789  0.21821789  0.          0.43643578
       0.          0.21821789  0.          0.43643578  0.21821789  0.          0.
       0.          0.21821789  0.21821789  0.          0.          0.21821789
       0.        ]
     [ 0.2410822   0.          0.          0.          0.2410822   0.2410822
       0.2410822   0.          0.          0.          0.          0.          0.
       0.          0.2410822   0.55004769  0.          0.          0.          0.
       0.2410822   0.          0.          0.          0.          0.48216441
       0.          0.          0.          0.          0.          0.2410822
       0.          0.2410822 ]
     [ 0.          0.644003    0.48300225  0.          0.          0.          0.
       0.16100075  0.16100075  0.          0.16100075  0.          0.16100075
       0.16100075  0.          0.12244522  0.          0.          0.16100075
       0.          0.          0.          0.16100075  0.          0.          0.
       0.3220015   0.16100075  0.          0.          0.16100075  0.          0.
       0.] return feature name: ['before'.'know'.'things'.'today'.'The light is in'.'Millions of years'.'a'.'depends on'.'only'.'the day after tomorrow'.'meaning'.'Most'.'how to'.'如果'.'the universe'.'我们'.'所以'.'give up'.'way'.'tomorrow'.'the galaxies'.'the night'.'something'."Cruel".'each'.'see'.'real'.'secret'.'absolute'.'good'.'contact'.'the past'.'还是'.'这样']
    Copy the code

Decision tree algorithm API

  • Class sklearn. Tree. DecisionTreeClassifier (criterion = gini, max_depth = None, random_state = None)
    • criterion
      • Feature selection criteria
      • “Gini” or “entropy”, where the former stands for gini coefficient and the latter for information gain. Default “gini”, i.e. CART algorithm.
    • min_samples_split
      • Minimum number of samples required for internal node subdivision
      • This value limits the conditions under which subtrees can continue to be divided. If the number of samples of a node is less than min_samples_split, no further attempts will be made to select the optimal feature for partitioning. The default is 2. If the sample size is small, ignore this value. If the sample size is very large, it is recommended to increase this value. In one of my previous project examples, which had about 100,000 samples, I chose min_samples_split=10 to create a decision tree. Can serve as a reference.
    • min_samples_leaf
      • Minimum number of leaf nodes
      • This value limits the minimum sample number of leaf nodes. If the number of a leaf node is less than the sample number, it will be pruned together with its sibling nodes. The default is 1, and you can enter the minimum sample size as an integer, or the minimum sample size as a percentage of the total sample size. If the sample size is small, you don’t need to worry about this value. If the sample size is very large, it is recommended to increase this value. The previous 100,000 sample project used min_SAMples_LEAF with a value of 5 for reference only.
    • max_depth
      • Maximum depth of decision tree
      • The maximum depth of the decision tree is optional by default. If this parameter is not specified, the decision tree does not limit the depth of the subtree. In general, you can leave this value alone when there is little data or few features. If the model sample size is large and features are large, it is recommended to limit the maximum depth. The specific value depends on the distribution of data. The value ranges from 10 to 100
    • random_state
      • Random number seed

Six cases.

Titanic passenger survival prediction

6.1 Titanic Data

The Titanic and titanic2 data frames depict the survival status of individual passengers on the Titanic. The data set used here was started by various researchers. These include passenger lists created by many researchers, edited by Michael A. Findlay. The characteristics of the data set we extracted were ticket category, survival, ride class, age, landing, home.dest, room, ticket, boat and gender.

Data: biostat.mc.vanderbilt.edu/wiki/pub/Ma…

According to the observed data:

  • Class 1 refers to the passenger class (1, 2, 3), which represents socioeconomic class.
  • 2 The AGE data is missing.

6.2 Procedure Analysis

  • To get the data
  • Basic data processing
    • Determine the eigenvalue and target value
    • Missing value handling
    • Data set partitioning
  • Feature engineering: dictionary feature extraction
  • Machine learning: Decision trees
  • Model to evaluate

6.3 code

  • The import module

    import pandas as pd
    from sklearn.feature_extraction import DictVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier, export_graphviz
    Copy the code
  • To get the data

    # Fetch data
    titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")
    Copy the code
  • Basic data processing

    • Determine the target value and eigenvalue

      # Determine the eigenvalue and target value
      x = titan[["pclass"."age"."sex"]]
      y = titan["survived"]
      Copy the code
    • Missing value handling

      # Missing values need to be processed, and these features with categories among the features are extracted by dictionary features
      x['age'].fillna(x['age'].mean(), inplace=True)
      Copy the code
    • Data set partitioning

      # Data set partitioning
      x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)
      Copy the code
  • Feature engineering: dictionary feature extraction

    DictVectorizer One-hot encoding is required when category symbols appear in the features.

    X.to_dict (Orient =”records”) needs to convert array characteristics to dictionary data

    Feature engineering: dictionary feature extraction
    # for x to convert to dictionary data x.to_dict(Orient ="records")
    # [{" pclass ":" 1 st ", "age" : 29.00, "sex", "female"}, {}]
    transfer = DictVectorizer(sparse=False)
    x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
    x_test = transfer.fit_transform(x_test.to_dict(orient="records"))
    Copy the code
  • Decision tree model training and evaluation

    In the decision tree API, if max_depth is not specified, the information entropy condition will be used until the end. Here we can limit the size of the tree by specifying the depth of the tree

    # Machine learning: Decision trees
    estimator = DecisionTreeClassifier(criterion="entropy", max_depth=5)
    estimator.fit(x_train, y_train)
    
    # Model evaluation
    estimator.score(x_test, y_test)
    estimator.predict(x_test)
    Copy the code

6.4 Visualization of decision tree

6.4.1 Save the tree structure to the DOT file

  • The sklearn.tree.export_graphviz() function exports the DOT format
    • Tree. Export_graphviz (estimator, out_file = ‘tree. Dot’ feature_names = [‘ ‘, ‘ ‘])
export_graphviz(estimator, out_file="./data/tree.dot", feature_names=['age'.'pclass=1st'.'pclass=2nd'.'pclass=3rd'.'women'.'male'])
Copy the code

The content of the DOT file is as follows:

digraph Tree {
node [shape=box] ;
0 [label="Male <= 0.5\nentropy = 0.919\nsamples = 984\nvalue = [655, 329]"];1 [label="pclass=3rd <= 0.5\nentropy = 0.912\nsamples = 336\nvalue = [110, 226]"];0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"];2 [label="age <= 38.5\nentropy = 0.346\nsamples = 185\nvalue = [12, 173]"];1 -> 2 ;
3 [label="age <= 17.5\nentropy = 0.399\nsamples = 139\nvalue = [11, 128]"];2 -> 3 ;
4 [label="entropy = 0.0\nsamples = 15\nvalue = [0, 15]"];3 -> 4 ;
5 [label="age <= 37.5\nentropy = 0.432\nsamples = 124\nvalue = [11, 113]"];3 -> 5 ;
6 [label="entropy = 0.409\nsamples = 122\nvalue = [10, 112]"];5 -> 6 ;
7 [label="entropy = 1.0\nsamples = 2\nvalue = [1, 1]"];5 -> 7 ;
8 [label="age <= 49.5\nentropy = 0.151\nsamples = 46\nvalue = [1, 45]"];2 -> 8 ;
9 [label="entropy = 0.0\nsamples = 26\nvalue = [0, 26]"];8 -> 9 ;
10 [label="age <= 50.5\nentropy = 0.286\nsamples = 20\nvalue = [1, 19]"];8 -> 10 ;
11 [label="entropy = 0.722\nsamples = 5\nvalue = [1, 4]"];10 -> 11 ;
12 [label="entropy = 0.0\nsamples = 15\nvalue = [0, 15]"];10 -> 12 ;
13 [label="age <= 38.5\nentropy = 0.935\nsamples = 151\nvalue = [98, 53]"];1 -> 13 ;
14 [label="age <= 32.097\nentropy = 0.945\nsamples = 146\nvalue = [93, 53]"];13 -> 14 ;
15 [label="age <= 19.5\nentropy = 0.93\nsamples = 139\nvalue = [91, 48]"];14 -> 15 ;
16 [label="entropy = 0.998\nsamples = 21\nvalue = [10, 11]"];15 -> 16 ;
17 [label="entropy = 0.897\nsamples = 118\nvalue = [81, 37]"];15 -> 17 ;
18 [label="age <= 36.5\nentropy = 0.863\nsamples = 7\nvalue = [2, 5]"];14 -> 18 ;
19 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 4]"];18 -> 19 ;
20 [label="entropy = 0.918\nsamples = 3\nvalue = [2, 1]"];18 -> 20 ;
21 [label="entropy = 0.0\nsamples = 5\nvalue = [5, 0]"];13 -> 21 ;
22 [label="age <= 12.5\nentropy = 0.632\nsamples = 648\nvalue = [545, 103]"];0 -> 22 [labeldistance=2.5, labelangle=-45, headlabel="False"];23 [label="Pclass =3rd <= 0.5\nentropy = 0.702\nsamples = 21\nvalue = [4, 17]"];22 -> 23 ;
24 [label="entropy = 0.0\nsamples = 13\nvalue = [0, 13]"];23 -> 24 ;
25 [label="age <= 3.5\nentropy = 1.0\nsamples = 8\nvalue = [4, 4]"];23 -> 25 ;
26 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 2]"];25 -> 26 ;
27 [label="age <= 7.5\nentropy = 0.918\nsamples = 6\nvalue = [4, 2]"];25 -> 27 ;
28 [label="entropy = 0.0\nsamples = 2\nvalue = [2, 0]"];27 -> 28 ;
29 [label="entropy = 1.0\nsamples = 4\nvalue = [2, 2]"];27 -> 29 ;
30 [label="Pclass =1st <= 0.5\nentropy = 0.577\nsamples = 627\nvalue = [541, 86]"];22 -> 30 ;
31 [label="age <= 18.5\nentropy = 0.462\nsamples = 501\nvalue = [452, 49]"];30 -> 31 ;
32 [label="entropy = 0.0\nsamples = 17\nvalue = [17, 0]"];31 -> 32 ;
33 [label="age <= 45.5\nentropy = 0.473\nsamples = 484\nvalue = [435, 49]"];31 -> 33 ;
34 [label="entropy = 0.484\nsamples = 468\nvalue = [419, 49]"];33 -> 34 ;
35 [label="entropy = 0.0\nsamples = 16\nvalue = [16, 0]"];33 -> 35 ;
36 [label="age <= 60.5\nentropy = 0.873\nsamples = 126\nvalue = [89, 37]"];30 -> 36 ;
37 [label="age <= 28.5\nentropy = 0.9\nsamples = 117\nvalue = [80, 37]"];36 -> 37 ;
38 [label="entropy = 1.0\nsamples = 14\nvalue = [7, 7]"];37 -> 38 ;
39 [label="entropy = 0.87\nsamples = 103\nvalue = [73, 30]"];37 -> 39 ;
40 [label="entropy = 0.0\nsamples = 9\nvalue = [9, 0]"];36 -> 40 ;
}
Copy the code

6.4.2 Displaying decision tree structure on the website

webgraphviz.com

6.5 Decision tree summary

  • Advantages:
    • Simple understanding and interpretation, tree visualization.
  • Disadvantages:
    • Decision tree learners can create overly complex trees that do not generalize data well and are prone to overfitting.
  • Improvement:
    • Pruning CART algorithm
    • Random forest (a type of integrated learning)

Note: For important decisions of enterprises, decision tree is widely used in the decision-making process due to its good analytical ability, and features can be selected