0 x00 preface
This is part 2 of “The Entropy of Chaos in Machine Learning Interviews”. We will share the entropy mentioned above, including joint entropy, conditional entropy, relative entropy, and cross entropy.
0 x01 joint entropy
Although there are so many entropies mentioned above, they are intrinsically related. Therefore, we try our best to study according to their intrinsic relationship, starting with joint entropy.
Joint entropy is related to joint probability distribution. For random variables X and Y, their joint probability distribution is P (X, Y), then the entropy of the joint probability distribution is called joint entropy:
H(x,y) =− σ p(x,y) log(p(x,y))
Let’s assume that X and Y are independent of each other, and you can think of them as the flip of a coin, so you can think of them in a way that’s not too abstract. The probability of X heads is p1, and the probability of Y heads is P2, so what is their combined entropy?
Obviously, we need to find the joint probability distribution, as shown below:
The value is | The probability of |
---|---|
(positive and negative) | p1 * (1-p2) |
(are) | p1 * p2 |
(anyway) | p2 * (1-p1) |
(narrow) | (1-p1) * (1-p2) |
So this is the joint distribution, and you can figure out what the entropy is, but I’m not going to write it down because it’s too cumbersome to write it down, so if you’re interested, you can write it down.
So the question we care about is, what does H of x and y have to do with H of x and H of y?
I’m not going to do a lot of math here, but I’m going to continue to use our intuition to help us think. Looking at the figure above, we compare it with the probability distribution of Y.
Whereas Y originally had only two probabilities, P2 and (1-P2), the joint distribution has four probabilities, which in turn can be thought of as splitting each probability of Y. P2 is divided into P2P1 and P2 (1-P1). In other words, for each value of Y, there is an uncertainty (p2). By combining with X, the uncertainty of each value is introduced, and the uncertainty is obviously increased.
If you understand what entropy means above, then it’s not hard to figure out that H of x and y must be greater than or equal to H of x and H of y. It’s only when there’s no uncertainty in X, like it’s always heads, when you combine X with Y, you don’t introduce any new uncertainty, so H of X, Y is equal to H of Y.
Above, we have determined some properties of joint entropy by perceptual intuition without using mathematics. Therefore, it is very important to be good at intuition.
0 x02 conditional entropy
Now we know that the combined entropy of x and y is greater than or equal to the entropy of x and y alone, and for y, the introduction of x increases the entropy, so how much does the introduction of x increase entropy? That’s conditional entropy.
H (x | y) = H (x, y) – H (y)
Here, there is an easy to make a mistake, entropy H (x | y) is called a condition, it is not a conditional probability p (x | y) entropy. Why? Because of p (x | y) a probability distribution is not at all!!!!!! Above or below two coin, for example, we calculate the p (x | y), note that our example assumes that the x, y are independent of each other.
P(X | Y) | The probability of |
---|---|
P(正 | 反) | p1 |
P (| just) | p1 |
P (anti |) | 1-p1 |
P (against |) | 1-p1 |
Visible, all of the addition of P (x | y) is 2, is not a probability distribution. One might say, well, why is it called conditional entropy? It’s deliberately misleading!
And that’s because the calculation of conditional entropy has something to do with conditional probability. As follows:
H (x | y) = – Σ p (x, y) log (p (x | y))
This formula can be derived from the above definition of entropy and conditional entropy, so I’m not going to derive it here, but if you’re interested, you can derive it on your own. It’s not that hard.
Here we analyse the condition entropy H (x | y) and the relationship between H (x), are still using the intuitive method. Conditional entropy is the added uncertainty after X is introduced into Y. In the sense, the added uncertainty cannot be greater than X’s own uncertainty, that is:
H( x|y )<=H(x)
This is only true if x and y are independent of each other.
This is something that we know intuitively and, in fact, can prove.
Those of you who have studied Statistical Methods of Learning are familiar with the concept of information gain, but we can still use intuition to understand this concept.
We need to read in another way, however, H (x | y) < = H (x)
Before, we came to this conclusion by saying that X can’t add more uncertainty to Y than X itself. Is on the other hand, the uncertainty of the original X H (X), now we introduce Y, obtained the combined uncertainty of H (X, Y), from the uncertainty of the uncertainty of minus Y H (Y), the rest is H (X | Y), the value less than or equal to H (X), mean? It shows that, due to the introduction of Y, the uncertainty of X decreases, which is the increase of information content. The extent to which the uncertainty becomes smaller is the information gain:
Ga in (X) = H (X) – H (X | y)
Information gain is also called mutual information. They are exactly the same.
Note: here I simply say the view of Li Hangshu, a word: not suitable for entry! Only suitable for improvement and interview preparation. My big problem is the derivation of logistic regression, which is not elegant. Later, I will write a special article on the derivation of logistic regression.
Relative entropy (also called mutual entropy, KL divergence)
Relative entropy comes last because it is not very relevant to the previous concepts.
Suppose we have the following five samples:
sample | The label |
---|---|
1 | 1 |
2 | 1 |
3 | – 1 |
4 | – 1 |
5 | – 1 |
We’re going to derive the actual distribution of labels from this. Suppose the true distribution of labels is q(x), with x values of 1 and -1
To determine the value of q(x) at x=1 and x=-1, we of course apply the maximum likelihood rule. The maximum likelihood function is:
∗ Q (x=1) ∗q(x= 1)∗q (x=−1) ∗ Q (x=−1) ∗ Q (x=−1)
If you have difficulty understanding the maximum likelihood function, I suggest you supplement this knowledge, because this knowledge is used too much in machine learning, really can not get around, logistic regression is also essentially using the maximum likelihood function.
Combine the above formula and get:
Q (x= 1)2 ∗q(x =−1) 3
Further, take the above equation to the fifth power and get:
Q (x= 1)(2/5) ∗ Q (x= −1) (3/5)
So let’s say p of x is the label distribution that we get from the sample, 2/5 is p of x=1, 3/5 is p of x=-1. Therefore, the final expression of the maximum likelihood function is obtained by normalizing the above formula:
Π q (x) p (x)
And then you take the log
Σ p (x) log (q (x))
I’m going to take the minus sign
D = – Σ p (x) log (q (x))
Maximizing the likelihood function is minimizing d.
Impatient friends may be wondering, after all, what is mutual entropy? Now give the definition:
D (q | | p) = D – H (p)
The d here is the d we got up here.
What does the mutual entropy describe? How do you intuitively understand that? After all, it’s a little too mathematical to derive mutual entropy from maximum likelihood above.
Here, we need to look at d, assuming that q and P are exactly the same, then d is H of p, and d is equal to 0.
Similarly, the closer q and P are, the more identical they are, the smaller D is. Mutual entropy D (q | | p) is, in fact, in the measure we calculated the actual distribution of the expression of q, exactly and derived from sample statistical distribution of p how close, in the measure of how close to this concept, we use entropy to form.
0 x04 cross entropy
The cross entropy comes last, because it’s the easiest, and it’s just d factorial.
We can see why it’s called cross entropy, what does cross mean?
D = – Σ p (x) log (q (x))
Originally, p’s entropy was -sigma p(x)log(p(x)), q’s entropy was -sigma q(x)log(q(x)).
Now, replace the log component of p entropy with q(x), and the log component of q entropy with P (x). is
D = – Σ p (x) log (q (x))
D2 = – Σ q (x) log (p (x))
Where d is the cross entropy of p and q, d2 is the cross entropy of q and P.
We also realize that the cross entropy does not satisfy the exchange law, that is to say, the cross entropy of P and Q is different from the cross entropy of Q and P.
0xFF No summary, no progress
The above is my understanding and feeling of various concepts of entropy, and I hope to share it with you in the hope that it can help friends in need. Of course, due to my personal understanding, it is inevitable that there are some shortcomings or even mistakes. I hope you can give me more criticism and correction!
Solemnly declare: this article reprinted, copied, quoted from the sword go tianyaEntropy of Chaos in Machine learning Interviews.
WeChat pay
- The author: Mudong Koshi
- Links to this article: www.mdjs.info/2018/01/27/…
- Copyright Notice: All articles on this blog are licensed under a CC BY-NC-SA 3.0 license unless otherwise stated. Reprint please indicate the source!