1. Understand entropy

1. What is entropy? Who defined it? What's it for? Why does machine learning use entropy? With these questions slowly began to explore ~Copy the code

Entropy, one of the parameters describing the state of matter in thermodynamics, is represented by the symbol S, and its physical significance is a measure of the disorder degree of the system. T.Clausius proposed the concept of entropie in 1854, and Chinese physicist Professor Hu Gang Fu first translated entropie into “entropy” in 1923 based on the meaning of thermal quotient. A. Instein once summarized the status of entropy theory in science as “entropy theory is the first law for the whole science”.


To understand entropy, you have to talk a little bit about physics.

In the 19th century, physicists began to realize that energy was the driving force of the world and came up with the “law of conservation of energy,” which states that the sum of energies is constant. But there was one thing that puzzled them.

Physicists have discovered that energy cannot be converted 100 percent. A steam engine, for example, uses heat energy and converts it into mechanical energy to move the machine. In this process, some of the heat energy is always lost and cannot be completely converted into mechanical energy.

(In the figure above, the conversion of energy E always results in an energy loss ∆E.)

At first, physicists thought it was due to poor technology, but it turned out that no amount of progress could reduce energy loss to zero. They called entropy the amount of energy wasted in the process of energy conversion that can’t be reused.

Later, this concept was summed up as the “second law of thermodynamics” : the transformation of energy will always produce entropy, and in a closed system, all energy will eventually become entropy.

If entropy is energy, why can't we use it? How did it come about? Why does all energy end up as entropy?Copy the code

Physicists have many different explanations, but one that I find easiest to understand is that when energy is converted, most of the energy is converted into pre-determined states, such as heat into mechanical energy and electricity into light. But, as with cellular mutations, some of that energy generates new states. This energy is entropy, and because it’s in a different state, it’s hard to use it, unless there’s some new energy coming in from the outside that deals specifically with entropy.

(Above, many new states are created in the process of energy conversion.)

In short, energy transformation creates new states, and entropy is the energy that goes into those states.

Now think about: what does it mean to be more states?Copy the code

State is more, that is, more possibility, that is, more chaos; Fewer states, fewer possibilities, relatively orderly. Therefore, another expression of the above conclusion is that energy conversion will increase the disorder of the system, and entropy is the disorder of the system.

(In the figure above, low entropy means low disorder, and high entropy means high disorder.)

The more energy is converted, the more new states are created, so high energy systems are not as stable as low energy systems because they have higher entropy. Moreover, every system in motion has an energy conversion, and the second law of thermodynamics states that all closed systems will eventually move towards maximum chaos unless energy is injected from the outside.

One thing entropy has taught me is that without external influence, things will always get messier. For example, if a room is not cleaned, it will only get messier, not cleaner.Copy the code

(In the picture above, if you don’t expend energy cleaning, your room is always getting messier and messier.)

Entropy is a measure of disorder, and the higher the disorder of a system, the higher its entropyCopy the code

Two, understand the amount of information

We know that entropy originated in physics as a measure of disorder in a thermodynamic system. In information theory, entropy is a measure of uncertainty.

And again, the question arises, how does entropy relate to information theory?Copy the code

Information is something we talk about all the time, but the very idea of information remains abstract. Definition in Baidu Encyclopedia: information, generally refers to all content transmitted by human society, refers to the object transmitted and processed by audio, message and communication systems.

1. The amount of information is related to the probability of event occurrence. The lower the probability of event occurrence, the greater the amount of information transmitted; 2. The amount of information should be non-negative, and the amount of information of inevitable events should be zero; 3. The information amount of two events can be added together, and the combined information amount of two independent events should be the sum of their respective information amount;Copy the code

It can be expressed mathematically as follows:

3. Understand information entropy

But can information be quantified, and how? The answer, of course, is information entropy. As early as 1948, Shannon pointed out in his famous paper “Mathematical Principles of Communication” that “information is used to eliminate random uncertainties” and proposed the concept of “information entropy” (borrowed from the concept of entropy in thermodynamics) to solve the measurement problem of information.

Ok, so there's entropy of information! So what explains it? How do you calculate entropy?Copy the code

To take the same example as wu Jun’s in The Beauty of Mathematics, assuming that the final 32 of the World Cup has been determined, then the random variable “Who is the world Champion among the final 32 of the 2018 FIFA World Cup in Russia?” How much information is there?

According to Shannon’s information entropy formula, for any random variable X, its information entropy is defined as follows, in bit:

You can use either of these formulas for entropy, and they're equivalent, they mean the same thing.Copy the code

Then the amount of information of the above random variable (who wins the championship) is:

Among them, p1, p2,… ,p32 is the probability of the top 32 teams winning. Wu’s book offers several conclusions: First, if the top 32 teams are equally likely to win, H=5; Second, H<5 when the probability of winning the championship is different; And three, H can’t be greater than 5. For the first conclusion: the result is obvious, equal probability, i.e., each team has a 1/32 chance of winning, so H=-(1/32)·log(1/32)+(1/32)·log(1/32)+… + (1/32), log (1/32)) = log (1/32) = log (32) = 5 (bit)

For the second conclusion and the third conclusion: Lagrange multiplier method is used to prove, please refer to “Lagrange Multiplier Method of Extreme Value under Constraints”. This actually means that the more equal the probability of various kinds of randomness in the system, the greater the information entropy, and vice versa.

It can be seen from shannon's mathematical formula that information entropy is actually the mathematical expectation of the amount of information of a random variable.Copy the code

In our daily lives, we often say that someone is very concise but has a lot of information, and someone is very eloquent but talks a lot of nonsense and has very little information. The plot of this TV play is so slow that one episode is almost finished and nothing is shown. How does the amount of information/content here relate to the entropy of information?

Many people conflate these things with information entropy, concluding that “the more information you have to say, the higher your entropy is.” “The more concise your language, the higher your entropy is. The more redundant the language, the lower the entropy of information.” And so on.

Not that these claims are wrong, but misleading. Personally think here everyday context information is not so much the amount of information, information quality and information transmission efficiency problem, rather to have dry goods, do you have any idea, do you have any ideas, and in a certain text length/play time, can effectively express, this is actually a person’s ability to problems, and the information entropy nothing relationship very not!

Joint entropy, conditional entropy and cross entropy

Joint Entropy: The Joint distribution of two random variables X and Y, which can form a Joint Entropy, denoted by H(X,Y). Condition entropy: on the premise of a random variable X occurs, random variable Y in new entropy is defined as the conditional entropy of Y, H (Y | X), said the measure under the condition of random variable X is known of the uncertainty of the random variable Y.Copy the code

And the formulas were established: H (Y | X) = H (X, Y) – H (X), the whole formula said entropy (X, Y) contains minus X occur alone contains entropy. See derivation for how:

The second row goes to the third row based on the fact that the marginal distribution P (x) is equal to the sum of the joint distribution P (x,y); The third row goes to the fourth row by multiplying the common factor logp of x, and then writing x and y together; P (x,y)-logp(x,y)-logp(x))-log (p(x,y)/p(x))); Line 5 to 6 lines is based on the premise that the definition of conditional probability p (x, y) = p (x) * p (y | x), so the p (x, y)/p (x) = p (y | x).Copy the code

Relative entropy: also known as mutual entropy, cross entropy, discrimination information, Kullback entropy, Kullback-Leible divergence, etc. Let p(x) and q(x) be the two probability distributions of values in x, then the relative entropy of P to Q is:Copy the code

Appendix: joint entropy, conditional entropy, cross entropy, mutual information


# Cross entropy example

Given that x is the correct probability distribution and y is the predicted probability distribution, the result of this formula shows how wrong y is from the correct answer x. The smaller the result, the more accurate y is, and the closer it is to X.

Such as:

The probability distribution of x is: {1/4, 1/4, 1/4, 1/4}. Now we can predict two sets of values through machine learning:

The probability distribution of y1 is {1/4, 1/2, 1/8, 1/8}

The probability distribution of y2 is {1/4, 1/4, 1/8, 3/8}

Intuitively, in the y2 distribution, the first two terms are 100% correct, whereas in y1, only the first term is 100% correct, so y2 feels more accurate. Let’s see if the formula works out intuitively:

The comparison results show that the calculated value of H(x,y1) is 9/4, while the value of H(x,y2) is slightly less than 9/4. According to the explanation just now, the smaller the cross entropy is, the closer the two distributions are. Therefore, the cross entropy is often used as loss function in machine learning.