• The article was transferred from wechat official account: Alchemy of machine learning
  • Author: Chen Yixin (Welcome exchange and common progress)
  • Contact: Wechat CYX645016617

[TOC]

1.1 Perceptual Understanding

IS IS for Inception Score.

Entropy can be used to describe randomness: if a random variable is highly predictable, then it has low entropy; Conversely, if it’s out of order and random, then it has high entropy. This is in the same way that cross-entropy is used to train classification networks.

In the figure below, we have two probabilities: the Gaussian distribution and the uniform distribution. It can be inferred that the entropy of the Gaussian distribution is smaller than the entropy of the uniform distribution. The Gaussian distribution indicates that the random number is near 0

The IS metrics measure two capabilities of model generation:

  • The quality of the generated pictures;
  • Generate diversity of images.

We define XXX as the generated image and YYy as the discriminator classification result of the generated image (take Imagenet as an example, it is a classification of 1000 categories).

The higher the quality of the image, the more certain the result of discriminator classification will be. So P (y ∣ x) P (y | x) P (y) ∣ x is determined, the smaller the entropy, so the better the image quality.

Before, XXX could be understood as a picture, so XXX is a collection of generated pictures, for example, a collection of 1000 randomly generated pictures.

And then all of those 1,000 images, you put them into a discriminator to determine the category. Diversity, at its best, produces the same number of images for each category. This is when the probability of generating different categories is equal, which means the entropy is highest (because the generation of categories is uncertain).

So we hope that the p (y ∣ x) p (y | x) p (y ∣ x) as small as possible, the higher the quality; P (y)p(y)p(y) Bigger is better, variety is better.

1.2 Mathematical Derivation

The higher the quality, p (y ∣ x) p (y | x) p (y ∣ x) of entropy, the better. P (y ∣ x) p (y | x) p (y ∣ x) the entropy can be written as the following formula:


i = 1 m p ( y i x i ) x l o g ( p ( y i x i ) ) -\sum^m_{i=1}{p(y_i|x_i)\times log(p(y_i|x_i))}

Where m represents the number of images generated.

The greater the diversity, the better, meaning that p(y), p(y), and p(y) have a larger entropy, the formula can be written as:


i = 1 m p ( y i ) x l o g ( p ( y i ) ) -\sum^m_{i=1}{p(y_i)\times log(p(y_i))}

Again, the KL divergence, which is a function that describes the distance between two distributions, is used in GAN.


K L ( p ( y x ) . p ( y ) ) KL(p(y|x),p(y))


= i = 1 m p ( y i x i ) x l o g ( p ( y i x i ) p ( y i ) ) =\sum^m_{i=1}{p(y_i|x_i) \times log(\frac{p(y_i|x_i)}{p(y_i)})}


= i = 1 m p ( y i x i ) x l o g ( p ( y i x i ) ) =\sum^m_{i=1}{p(y_i|x_i) \times log(p(y_i|x_i))}


i = 1 m p ( y i x i ) x l o g ( p ( y i ) ) -\sum^m_{i=1}{p(y_i|x_i) \times log(p(y_i))}

As you can see, the distribution of the two KL divergence can be turned into a form of subtraction, the former is we calculate the description of the image quality of entropy, we’ve written in simple – subsequent E (p (y) ∣ x) – E (p (y | x)) – E (p (y) ∣ x)

Which we have to continue to transform: ∑ I = 1 mp (yi ∣ xi) x log (p) (yi) \ sum ^ m_ {I = 1} {p (y_i | x_i) \ times log (p) (y_i)} ∑ I = 1 mp (yi ∣ xi) x log (p) (yi)


= i = 1 m p ( y i x i ) p ( y i ) x p ( y i ) x l o g ( p ( y i ) ) =\sum^m_{i=1}{\frac{p(y_i|x_i)}{p(y_i)} \times p(y_i) \times log(p(y_i))}

This formula, we found that, with the description similar to the entropy generation diversity, much more is a p (yi ∣ xi) p (yi) \ frac {p (y_i | x_i)} {p} (y_i) p (yi) p (yi ∣ xi), I wish this is equal to 1. Now prove this:


p ( y i x i ) p ( y i ) \frac{p(y_i|x_i)}{p(y_i)}


= p ( y i . x i ) p ( y i ) x p ( x i ) =\frac{p(y_i, x_i)}{p(y_i) \times p(x_i)}


= p ( y i ) x p ( x i ) p ( y i ) x p ( x i ) = 1 =\frac{p(y_i) \times p(x_i)}{p(y_i) \times p(x_i)}=1

But there’s an assumption involved here, that the variables of the joint probability are independent of each other. But we generate images and categories that are obviously not independent. This problem IS one of the limitations of IS. S IS just the KL divergence of the two distributions. So in summary:


K L ( p ( y x ) . p ( y ) ) = E ( p ( y x ) ) + E ( p ( y ) ) KL(p(y|x),p(y)) = -E(p(y|x)) + E(p(y))

Therefore, the greater the KL divergence, the higher the IS value, the greater the diversity entropy and the smaller the mass entropy, the better the generated model will be.

Physical significance of 1.3 KL divergence

KL divergence measures the distance between two distributions.

This figure, we hope that the conditional probability p (y ∣ x) p (y | x) p (y ∣ x) is a certain number of, then can be understood as a variance of zero gaussian distribution; And the other probability p(y), p(y), p(y) is a uniform distribution, so it’s a Gaussian distribution with infinite variance. The standard deviations are so different that we don’t have to worry about where the mean is. Therefore, we can simply see from the figure that the greater the distance between the two extreme Gaussian distributions, the better the quality and diversity of the model will be.

KL divergence takes advantage of both the deterministic and the uncertain parts of the generative network.