From the Medium
By Frank Preiswerk
Heart of the machine compiles
Participants: Nurhachu Null, Jiang Siyuan
Information theory and information entropy are very important concepts in AI or machine learning, and we often need to use its key ideas to describe probability distributions or quantify the similarity between probability distributions. In this paper, we discuss the basis of information theory from the most basic self-information and information entropy to cross entropy, and then derive the KL divergence from maximum likelihood estimation to strengthen our understanding of similarity between quantized distributions. Finally, we briefly discuss the application of information entropy in machine learning, including selecting the feature of decision tree by mutual information, measuring the loss of classification problem by cross entropy and Bayesian learning, etc.
Information theory is a branch of applied mathematics concerned with quantifying how much information a signal contains. It was originally invented to study the use of discrete alphabet to send messages over a noisy channel, such as by radio transmission to communicate. This paper mainly discusses the application of information entropy in AI or machine learning. Generally, in machine learning, we can apply information theory to continuous variables and use some key ideas of information theory to describe probability distributions or quantify the similarity between probability distributions.
Therefore, in machine learning, it is common to quantify the expected value of information related to random events, as well as the similarity between different probability distributions. In both cases, Shannon entropy is used to measure the information content in probability distributions. Shannon entropy, named after Claude Shannon, the father of information theory, is also called information entropy or differential entropy (continuity).
Since the information
The basic concept of Shannon’s entropy is the so-called self-information behind an event, sometimes called uncertainty. The intuitive interpretation of self-information is as follows: when an unlikely outcome of an event (a random variable) occurs, we assume that it provides a lot of information. Conversely, when a result is observed frequently, we assume that it has or provides little information. It is helpful to relate self-information to the unexpected nature of an event. For example, an extremely skewed coin will always come up heads in every flip. The outcome of any coin flip is completely predictable, so that we can never be surprised by the outcome, which means that the information we get from the experiment is zero. In other words, its self-information is 0. If the coin was slightly less skewed, then even though it saw heads more than 50 percent of the time, there would be some information in each flip. Therefore, its self-information is greater than 0. If the skew of the coin resulted in tails, we would still have zero self information. In an experiment with an unbiased coin, the probability of getting heads and tails on each flip is 50%, and we get the most unexpected result, because the outcome of coin flips is the least predictable in this case. We can also say that the entropy of uniform distribution is the highest, and the entropy of deterministic events is the lowest.
Based on the above informal requirements, we can find a suitable function to describe self-information. For a possible value x_1,x_2… The discrete random variable X of X_n, its probability mass function P (X), and any positive monotone decreasing function I(p_i) with a value between 0 and 1 can be used as a measure of information. In addition, another key attribute is the additivity of independent events; Two consecutive coin flips should give twice as much information as a single flip. This makes sense for independent variables, where the unexpected or unpredictability is twice as high. Formally, for the independent events x_i and X_j, we need I(p_i * p_j) = I(p_i) + I(p_j). The function that satisfies all these requirements is the negative logarithm, so we can use the negative logarithm to express self-information:
Figure 1 shows self information I(p).
Figure 1: Self-information of function I(p). Low probability corresponds to high self information and vice versa.
Let’s go back to the simple coin toss experiment. In information theory, 1bit of information (also known as Shannon) represents two possible outcomes of a single coin toss. Similarly, for two consecutive flips, 4 bits are needed to describe the possible outcomes in 4. In general, log 2(n) (log of 2) bit is used to describe the results of n consecutive independent random events, or self-information. Let’s verify the calculation of self-information in one of three consecutive experiments: there are a total of 2^3=8 possible outcomes, each with a probability of 0.5^3=0.125. Therefore, the self-information of this experiment is I (0.125) = -log_2(0.125) = 3. We need three bits to describe all of these possible outcomes, so the self-information for any one of three consecutive coin flips is equal to 3.0.
We can also compute the self-information of continuous random variables. Figure 2 shows three different probability density functions and their corresponding information functions. The Dirac delta shown in Figure 2 (A) corresponds to A strong bias, always with the same side-up skewed coin corresponding to zero entropy. All p(x)= 0 corresponds to an infinite amount of information. However, since these zero-probability events will never happen, this is just a hypothesis. The Gaussian probability density function in Figure 2 (B) is a simulation of the case where the same side is always up, but not always up. Finally, Figure 2 (C) describes a uniformly distributed probability density function, which corresponds to a uniform amount of information, similar to our coin without bias.
FIG. 2. Three different probability density functions and their self-information I (p) on [-3,3]. (A)Dirac δ function (fully determined); (B) Gaussian distribution μ = 0,σ = 0.5; (C) Evenly distributed
entropy
So far we’ve only talked about self information. In normal coin experiments, self information is actually equal to Shannon entropy, because all outcomes are equally likely to occur. In general, Shannon entropy is the self-information expected value of all possible outcomes of X:
Where b is the base of the logarithm. Above we used b=2, other common choices are b=10, and e. It doesn’t really matter, because there’s a constant relationship between the logarithms of different bases. We’re still assuming base 2 here, so we’re going to omit b from the following formula.
And if you pay attention, you might wonder what happens when p of x_i is equal to 0, because in that case we have to calculate 0 log of 0. In fact, what we need to calculate is a limit: lim_(p→0) p*log(p(x_i))=0. The process of solving it using L ‘Hopital’s rule or Taylor’s expansion can be done by referring to a book.
When Shannon entropy is generalized to the continuous domain, it usually refers to a kind of differential entropy. For continuous random variable X and its probability density function P (x), Shannon entropy is defined as follows:
The entropy of our three distributions is 0 (Dirac δ distribution), 174 (Gaussian distribution), and 431 (uniform distribution). The pattern that emerges in our experiments is that a wider distribution corresponds to a higher entropy of information. A close look at Figures 2 (B) and 2 (C) will help you understand. Although the area under the I (p) curve of the Gaussian distribution is much larger than that of the uniform distribution, its information entropy is much smaller than that of the uniform distribution, because the information entropy I (p) is weighted according to the probability density P, and p is close to 0 on both sides of the Gaussian distribution. A wider probability density corresponds to a greater entropy of information, and a good metaphor to help remember this is to imagine a gas filling a tank. We know from physics that entropy in a closed system increases over time and never decreases. After we inject gas from the other side of the tank, the distribution of the gas particles converges to a uniform value. Low entropy means that dense particles of gas gather in a particular area, and this will never happen spontaneously. A lot of gas particles are concentrated in a small area and it’s too early to tell the story that our Gaussian probability density function, in the Dirac delta distribution, is an extreme particle example, where all the gas is compressed in an infinitesimal area.
The cross entropy
Cross entropy is a mathematical tool used to compare two probability distributions P and Q. It’s similar to entropy, we calculate the expected value of log(q) with probability P, instead of the other way around:
In information theory, this quantity refers to the number of bits we would need if the “wrong” encoding, Q instead of P, were used to encode events that obey the Q distribution. In machine learning, this is a useful tool for measuring the similarity of probability distributions, and is often used as a loss function. Because cross entropy is equal to the KL divergence coupled with an information entropy, namely D_KL (p | | q) = H (p, q) – H (p). When we minimize the cross entropy for Q, H(p) is constant, so it can be omitted. Cross entropy in this case is also equivalent to KL divergence, because KL divergence can be simply derived from maximum likelihood estimation, so the following details take GAN as an example and use MLE to derive the expression of KL divergence.
KL divergence
Closely related to cross entropy, KL divergence is another measure of similarity in the machine learning: from q to KL divergence is as follows: p D_KL (p | | q). In bayesian inference, DKL (p | | q) measure when you changed from the prior distribution q to the posterior distribution p faith after the information gain, or in other words, the posterior distribution q is used to approximate information loss caused by the prior distribution p. For example, KL divergence is used in training the hidden space representation of a variational autoencoder. KL divergence can be expressed by entropy and cross entropy:
The cross entropy measures the average number of bits required to encode the event subject to P with the encoding scheme Q, while the KL divergence gives the extra bits brought by using the encoding scheme Q instead of the optimal encoding scheme P. From this, we can see that in machine learning, P is fixed, and there is only one constant additive difference between cross entropy and KL divergence, so they are equivalent from the objective of optimization. And from a theoretical point of view, it still makes sense to consider KL divergence, and one of the properties of KL divergence is that when p and Q are equal, it has a value of zero.
KL divergence has many useful properties, the most important of which is that it is non-negative. KL divergence is 0 if and only if P and Q are equally distributed in the case of discrete variables, or “almost everywhere” in the case of continuous variables. Because the KL divergence is non-negative and measures the difference between two distributions, it is often used as some kind of distance between distributions. However, it wasn’t really a distance because it is not symmetrical: for some P and Q, D_KL (P | | Q) is not equal to D_KL (Q | | P). This asymmetry means choosing D_KL (P | | Q) or D_KL (Q | | P).
In Li hongyi’s explanation, KL divergence can be derived from maximum likelihood estimation. Given a sample data distribution P_data(x) and the generated data distribution P_G(x; θ), then GAN hopes to find a set of parameters θ such that the distribution P_g(x; The shortest distance between θ) and P_data(x) is to find a set of generator parameters that allow the generator to generate very realistic images.
Now we can extract a set of real images from the training set to train P_G(x; The parameter θ in the θ distribution enables it to approximate the real distribution. Therefore, m real samples are now extracted from P_data(x) {𝑥^1,𝑥^2… 𝑥^𝑚}, where the symbol “^” represents the superscript, that is, the ith sample in x. For each real sample, we can calculate P_G(x^ I; θ), which is the probability of x^ I samples occurring in a generation distribution determined by θ. Therefore, we can construct the likelihood function:
Where “∏” represents multiplicative, P_G(x^ I; θ represents the probability of the ith sample appearing in the generation distribution. From the likelihood function, we know that the m real samples we extract are in P_G(x; The probability values of all occurrences in the θ distribution can be expressed as L. And because if P_G of x; θ) distribution is similar to P_data(x) distribution, so the real data is likely to appear in P_G(x; θ) distribution, so m samples all appear in P_G(x; Theta is going to be a very high probability distribution.
Below, we can maximize the likelihood function L to obtain the generative distribution closest to the real distribution (i.e., the optimal parameter θ) :
In the above derivation, we want to maximize the likelihood function L. If the logarithm of the likelihood function is taken, then the multiplication of ∏ can be transformed into the accumulation of ∑, and this process does not change the optimization result. So we can reduce the maximum likelihood estimation to θ that maximizes the expectation of log[P_G(x;θ)], and the expectation E[logP_G(x;θ)] can be expanded to the integral form of P_data(x)logP_G(x; θ)] over x: ∫P_data(x)logP_G(x; θ); Theta) dx. And because the optimization process is for θ, we can add an integral without θ without affecting the optimization. We can add -∫P_data(x)logP_data(x)dx. After adding this integral, we can combine the two integrals and construct a form similar to KL divergence. The process is as follows:
This integral is the integral of the KL divergence, so if we want to generate the distribution P_G(x; θ) is as close as possible to the parameter θ of the true distribution P_data(x), so we only need to find the parameter θ that minimizes the KL divergence. In addition, we can convert the integral form of KL divergence into the familiar KL divergence expression:
In the case of discrete variables, the KL divergence measures the amount of additional information needed to send a symbolic message containing a message produced by a probability distribution P using a code designed to minimize the length of the message produced by the probability distribution Q.
Use in machine learning
You might wonder how entropy and machine learning are related here. Let’s look at some specific areas.
Bayesian learning
First, the gaussian distribution example described above is important because it is a very common modeling choice in machine learning applications. The goal of machine learning is to reduce entropy. We want to make some predictions, and we have to be certain about our predictions. And entropy is a good measure of that confidence. In Bayesian learning, it is often assumed that a prior distribution has a broad probability density function, which reflects the uncertainty of random variables before observation. As the data comes in, the entropy decreases and the posterior distribution peaks around the most likely parameter values.
Decision tree learning
The learning algorithm of decision tree generally includes feature selection, decision tree generation and decision tree pruning. The feature selection of decision tree is to select the feature that can classify the training data, and usually the criterion of feature selection is information gain or information gain ratio.
In expericnce statistical learning methods, general entropy H (Y) and conditional entropy H (Y | X) can be referred to as the difference between the Mutual Information (Mutual Information), the Information in the decision tree learning gain is equivalent to the class in the training data and characteristics of Mutual Information. Given training data set D and feature A, empirical entropy H(D) represents the uncertainty of classification of data set D. And experience condition entropy H (D | A) said under the condition of the characteristics of A given the uncertainty of classifying data set D. Then their difference, the information gain, represents the degree to which the uncertainty of the classification of dataset D is reduced due to feature A. Obviously, for data set D, information gain depends on features, and different features often have different information gain. Features with large information gain have stronger classification ability.
The feature selection method according to the information gain criterion is: for the training data set (or subset) D, calculate the information gain of each feature, compare their size, and select the feature with the largest information gain.
Therefore, entropy is used to construct the tree in decision tree learning. The decision tree is constructed from the root node by dividing the dataset S into subdatasets based on possible “best” attributes, that is, attributes that minimize the entropy of the resulting subdatasets. This process is repeated recursively until there are no more attributes to split. This process is called ID3 algorithm, so it can be seen that the core of ID3 algorithm is to apply the information gain criterion selection feature on each node of decision tree and construct decision tree recursively.
classification
Cross entropy is the standard loss function in logistic regression and neural network, no matter in binary or multi-classification problems. In general, p is the real distribution and Q is the distribution described by the model. Let’s look at an example of binary logistic regression. The labels for the two categories are 0 and 1 respectively, and the Logistic model assigns the following probabilities to each input: Q_ (y=1) =y_hat, q_(y=0) = 1-y_hat. This can be abbreviated as Q ∈ {y_hat, 1 − y_hat}. Although the real labels are exactly 0 and 1, p ∈ {y, 1 − y} is written here, so don’t be confused by this expression. Under this marker, the cross entropy between the true value and the estimated distribution of each sample is as follows:
When it’s used as a loss function, we use the mean of the cross entropy of N samples,
conclusion
The above is basically the basis of information theory involved in machine learning. Although we do not often use the explanation of message length in information theory, machine learning mainly uses some key ideas of information theory to describe probability distributions or quantify the similarity between probability distributions. Information theory is what we must consider when constructing loss function, and its logarithmic form can be easily combined with the exponential form generally used by output units to improve the efficiency of learning. In addition, the success of modern deep networks and the popularity of maximum likelihood estimation are largely due to the great improvement of logarithmic loss function like information entropy.
Original link:Medium.com/swlh/shanno…
This article is compiled for machine heart, reprint please contact this public number for authorization.