The concept of entropy & ID3 algorithm of decision tree

0 x00 the

In this paper, an example is given to help you understand the concept of entropy & ID3 algorithm of decision tree.

0 x01 IT concepts

1. Information and information entropy of things

1.1 Information about things (the more information, the more certainty)

Information changes your level of uncertainty and curiosity. The more information you have, the more you know about something, and in turn, the less curious you are because you are more certain about it. If you decide that the probability of an event happening is 100%, and you think that the amount of information on the event is zero — well, since it’s all determined, there’s no information; On the other hand, if you’re not sure about something, you need to find out in various ways that it’s meaningful and informative.

1.2 the information entropy

To abstract the above model, the clever Shannon summed up the concept of entropy of information. Information entropy is used to indicate the degree of uncertainty of a thing. If the uncertainty of the thing is high, your curiosity is heavy, and the information entropy of the thing is high. How intuitive is it that the outcome of a close game is more uncertain than one where the outcome has already been seen?

In other words, the amount of information is inversely proportional to the probability of the event. For example, if the sun rises in the east, the probability is very high, there is very little information. On the other hand, when the sun rises in the west, the probability is small and the amount of information is large (and the world changes dramatically). Therefore, information entropy is often used as a quantitative index of the information content of a system, which can be further used as the goal of system equation optimization or the criterion of parameter selection. In other words, information entropy is used to measure the amount of information that an event may have in multiple states. It can also be considered as the expected value of the amount of information about the probability distribution of the event:

H (x) = - Σ p * log (p). Event X has n states, and I represents the ith state. Base b is usually set to 2, but can also be set to 10 or e. H(x) is the statistical information required to eliminate the uncertainty of this event, which is information entropy.Copy the code

The other thing to note is that when this thing does happen, the probability of it happening is one, then it has zero information, zero entropy of information.

2. Why is information entropy defined as -σ p*log(p)?

There are several reliable and easy to understand explanations found online:

2.1 explain 1

The number of possibilities of a piece of information is exponential as the number of bits increases. If you use binary bits, 1bit has 2 states, 2 bits has 4 states, and Nbit has 2 to the N possible states. The number of possibilities goes up with the exponential, the exponential then the linear form is log. It doesn’t matter if the base of the logarithm is e or 2, it’s just a scaling factor. One piece of information is log, and N pieces of information is Nlog. And finally, entropy is chaos, plus a minus sign if you think about it in a physical sense.

2.2 2

Information entropy stands for coding length. The longer the coding, the more information, and the greater the information entropy. So we have this equation: p=1/a^n, p=1/a^n, p=1/a^n, p=1/a^n, p=1/a^n, p=1/a^n, p=1/a^n, p=1/a^n. The inverse formula of information entropy is: n=-loga(p)

2.3 explain 3

First, the amount of information should depend on the probability distribution, so entropy should be defined as a monotone function of probability. Second, assuming that the two random variables are independent of each other, the amount of information obtained by observing them separately should be the same as the amount of information obtained by observing them simultaneously. I.e., h = h (x) (x + y) + h (y). The third amount of information should be greater than 0 and the fourth amount of information should be continuously dependent on the change in probability. We can prove that the only form of function that satisfies this condition is minus log of x, and of course we can multiply it by any constant.

2.4 explain 4

It is easy to see from the above sun rises in east and sets in west, and it is easy to see that the amount of information decreases with the increase of the probability of occurrence and cannot be negative. In addition, if we have two unrelated events, A and B, then we know that the information about the simultaneous occurrence of these two events is equal to the sum of the information about their respective occurrences. So h of A and B is equal to h of A plus h of B. And, according to Bayes’ theorem, p(A,B) = p(A) * p(B). According to the above, entropy should be defined as a monotone function of probability. It is easy to see the conclusion that the definition of entropy h should be the log function of the probability P (x), so the entropy of a random variable can be defined as: h(x)=-log_2p(x). The minus sign here is just to make sure that the entropy is either positive or zero, and the base of the log function, two, can be any number, but according to the general tradition, we use two as the base of the logarithm. We use entropy to measure the average amount of information in the whole random variable x, and the best measure of the average is the expectation of the random variable.

2.5 explain 5

If there are 32 teams in the World Cup, some teams are stronger and have a better chance of winning, while others have a lower chance. H = – (p1 * logp1 + p2 * logp2 +… + p32 * logp32), where p1,… P32 is the probability of each of the 32 teams winning the championship. When each team has an equal probability of winning the championship, which is 1/32, H = – (32 * 1/32 * log1/32) = 5, according to the maximum entropy principle, the probability of each event is the same, the entropy is the highest, the more uncertain the event is. We’re just saying how to calculate it, so why is the total amount of information the sum of all minus p logp? Wikipedia makes it clear to me: -logp is the amount of information about a possibility, and the total amount of information about an event is the amount of information about each possible situation times their probability, which is actually the mathematical expectation of the amount of information.

3. Entropy corresponding to decision tree ID3 algorithm

3.1 Entropy

It characterizes the purity of any sample set. Entropy quantifies the uniformity that exists in the sample set; the purer, the lower the entropy.

Let’s say I have a set S. If all members of S belong to the same class, then Entropy(S)=0; If the number of positive and negative samples of S is equal, then Entropy(S)=1; If the number of positive and negative samples of S is different, then Entropy(S) is between 0 and 1;

Namely: when all members belong to the same group, the entropy of the set is 0. The 0 value indicates that there are no impurities in the set, and all members in this example are true. In the second example, half of the members are positive and half of the members are negative, in which case the maximum value of entropy is 1. In a binary classification, the set entropy ranges from 0 to 1.

3.2 Measurement standard of information gain: entropy.

The information gain of an attribute is the decrease in expected entropy due to the partitioning of the sample using that attribute (or the expectation that the entropy will decrease when the sample is divided according to an attribute). In other words, how ordered the data set has become.

3.3 The soul of the decision tree

Splitting the tree for classification/regression purposes, depending on some indicator, is always expected to be as pure as possible.

3.4 The core idea of ID3 algorithm

The core idea of ID3 algorithm is to measure attribute selection by information gain, and select the attribute with the largest information gain after splitting for splitting. ID3 algorithm uses the information gain as greedy algorithm to divide the algorithm. It always selects the features with the largest information gain to divide the data, making the data more orderly. That is, if this attribute can obtain the maximum expected entropy reduction, then the location of this attribute is closer to the root node.

Gain(information Gain of the subtree) = Now entropy - If the subtree is classified by a feature, entropy = the change in entropyCopy the code

4. The inference

The greater the information gain (entropy reduction), the better ----> The smaller the entropy of the subtree, the better -----> The more unequal the tree, the better (the more unequal means the higher the purity, the better the effect, the more average means the greater the possibility that a value belongs to various different categories, the greater the entropy) ----> The more strongly the objects can be distinguished by this feature, the better.Copy the code

For example, if you use chromosomes to identify men and women, divide them into two groups, the “female” group must be all women, and the “male” group must be all men. This works best. If you’re judging men and women by whether they’re wearing skirts, there must be men and women in both groups, and that’s not a good standard.

5. How to calculate entropy?

def calc_shannon_ent(dataset): entropy= 0
　　forAll categories: p(probability of this category, or the proportion of this category in all data) = number of instances of this category/number of data entroy -= p * math.log(p,2)
Copy the code

For example, if the data set is [1, “yes”], [1, “yes”], then the number of all data is 2, only this type of data “yes”, so p = 2/2 = 1. Entroy is 1 times log of 1,2, which is 0. For example, the data set is [1, “yes”], [1, “yes”], [1, “no”]. Then the number of all data is 3, and there are two types of data,” yes” and “no”, and P is 2/3 and 1/3 respectively. Entroy = – (2/3) * log ((2/3), 2) – (1/3) * log ((1/3), 2) = 0.918.

The secret of this formula is that two-thirds is the percentage of the total number of records whose attribute is yes, and 1/3 is the percentage of the total number of records whose attribute is no.

0x02 How to select attributes using Liangshan Hero

Liangshan hero eats dinner, decide whether to eat pancake roll green onion. Now there are the following leaders: Lu Zhishen, Wu Song, Gong Sunsheng, Song Jiang, Wu Yong, Zhu Tong, Li Ying, Yan Shun, Le He, Ruan Xiaoqi

The corresponding code is

_data_set = [
[0.1."yes"], # Song Jiang is not a Buddhist, he is from Shandong and eats green Onions [0.1."yes"], # Wu Yong is not a Buddhist, but a native of Shandong province. He eats green Onions [0.1."yes"], # zhu Tong [0.1."yes"], # li Ying [0.1."yes"], # yan Shun [0.1."yes"], # le He [0.1."yes"], # Ruan Xiaoqi [1.1."yes"], # Wu Song [1.0."no"], # Lu Zhishen is a Buddhist, not from Shandong, and does not eat green Onions [1.0."no"[] # labels = ["Buddhist"."Shandong""Yes" means to eat green Onions, "No" means not to eat green Onions.Copy the code

1. Calculate the “original entropy”

The original entropy was calculated using the ratio of eating green Onions or not.

Entropy = Entropy(Yes) + Entropy(No) =8/10 * log(8/10.2) + 2/10 * log(2/10.2) = 0.8 * log(0.8.2) + 0.2 * log(0.2.2) = 0.721928
Copy the code

2. Calculate “new entropy” & Gain

2.1 Differentiation according to whether they are Monks or not

If they are divided according to whether they are monks or not, then

Monk: Lu Zhishen, Gong Sunsheng -- don't eat green onion/Wu Song -- eat green onion laymen: Song Jiang, Wu Yong, Zhu Tong, Li Ying, Yan Shun, Le He --> Eat green onionCopy the code

We found that a monk eats shallots alone, and two people do not eat shallots/both laymen eat shallots. It seems to be more reasonable to divide them according to the monks.

How to calculate entropy &Gain?

The layman:7The individual,7A yes. Family:3Individual, where 1yes, 2no
Entropy(By monks) = Entropy(layman) + Entropy(monk) = (7/10) * Entropy(7 yes) + (3/10) * Entropy(1 yes, 2 no) 
Entropy(7 yes) = 7/7 * log(7/7.2) = 0
Entropy(1 yes, 2 no) = 1/3 * log(1/3.2) + 2/3 * log(2/3.2) = 0.918Entropy(by monks) =0.7 * 0 + 0.3 * 0.918 = 0.2754Gain(by monks) = Entropy(original Entropy) - Entropy(by monks) =0.721928 - 0.2754
Copy the code

2.2 By “native place”

Lu Zhishen (Pingliang, Gansu), Gong Sunsheng (Jixian, Tianjin) --> Do not eat green onion Song Jiang, Wu Yong, Zhu Tong, Li Ying, Yan Shun, Le He, Wu Song (all shandong people in song Dynasty) --> eat green onionCopy the code

This is a more thorough distinction. So the attribute is better judged according to the origin.

How to calculate entropy &Gain?

The shandong:2The individual,2A no. Shandong:88 people,yes
Entropy(By native place) = Entropy(non-sz) + Entropy(sZ) = (2/10) * Entropy(2 no) + (8/10) * Entropy(8 yes) 
Entropy(2 no) = 2/2 * log(2/2.2) = 0
Entropy(8 yes) = 8/8 * log(8/8.2) = 0Entropy(by native place) =0 + 0 = 0Gain(by native place) = Entropy(original Entropy) - Entropy(by native place) =0.721928 - 0
Copy the code

The data also proves that it should be divided according to the place of birth.

In general, if you understand how branches are generated, you understand the ID3 algorithm. The method selects branches (attributes) by calculating the information gain of each attribute, and selects the attribute with the highest information gain as the new partition. According to this algorithm, the importance of an attribute is reflected through its influence on the entropy of the system. If a lot of uncertainty can be reduced, then it is an important attribute.

0xEE Personal information

Thoughts on life and technology

Wechat public account: Rosie’s Thinking

If you want to get updates on your own articles, or check out your own technical recommendations, please pay attention.

0 XFF reference

Blog.csdn.net/gadwgdsk/ar…

Blog.csdn.net/qq_36330643…

Blog.csdn.net/wendingzhul…

www.cnblogs.com/wkang/p/100…

www.cnblogs.com/daguonice/p…

www.jianshu.com/p/7ac73fe5b…

www.jianshu.com/p/4e2be2b56…

www.zhihu.com/question/30…

Tieba.baidu.com/p/585114489…

Blog.csdn.net/taoqick/art…