Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Today we derive the formula of the cross entropy loss function in two ways, and in the process explain why cross entropy can be used to classify the loss function of the problem.

Maximum likelihood estimation

In statistics, Maximum Likelihood Estimation (MLE), also known as Maximum Likelihood Estimation, is a method used to estimate the parameters of a probability model.

KL divergence

KL divergence was introduced as a measure of the difference between two probability distributions.


D K L ( P Q ) = i P ( i ) log P ( i ) Q ( i ) D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}

Measure the distance between Q(I)Q(I)Q(I) and P(I)P(I)P(I)P(I). So for multiple distributions,


D K L ( P ( y i x i ) P ( y ^ i x i ; Theta. ) ) D_{KL}(P^*(y_i|x_i)||P(\hat{y}_i|x_i; \theta))
  • P (y ^ I ∣ xi; Theta) P (\ hat {} y _i | x_i; \ theta) P (y ^ I ∣ xi; Where θ\thetaθ is the distribution parameter, xix_ixi is the ith sample, which can be a picture or sample, θ\thetaθ is the parameter of the neural network, and y^ hat{y}y^ is the prediction result of the neural network
  • P ∗ (yi ∣ xi) P ^ * (y_i | x_i) P ∗ (yi ∣ xi) according to the real distribution of data

D K L ( P ( y i x i ) P ( y ^ i x i ; Theta. ) ) = i P ( y i x i ) log P ( y i x i ) P ( y ^ i x i ; Theta. ) D_{KL}(P^*(y_i|x_i)||P(\hat{y}_i|x_i; \theta)) = \sum_i P^*(y_i|x_i)\log \frac{P^*(y_i|x_i)}{P(\hat{y}_i|x_i; \theta})

Let’s simplify it a little bit further


D K L ( P P ) = i ( P ( y i x i ) [ log P ( y i x i ) log P ( y ^ i x i ; Theta. ) ] D_{KL}(P^*||P) = \sum_i (P^*(y_i|x_i)\left[ \log P^*(y_i|x_i) – \log P(\hat{y}_i|x_i;\theta) \right]

D K L ( P P ) = i ( P ( y i x i ) log P ( y i x i ) i ( P ( y i x i ) log P ( y ^ i x i ; Theta. ) D_{KL}(P^*||P) = \sum_i (P^*(y_i|x_i) \log P^*(y_i|x_i) – \sum_i (P^*(y_i|x_i)\log P(\hat{y}_i|x_i; \theta)

∑ I (P ∗ (yi ∣ xi) log ⁡ P ∗ (yi ∣ xi) \ sum_i (P ^ * (y_i | x_i) \ log P ^ * (y_i | x_i) ∑ I (P ∗ (yi ∣ xi) logP ∗ (yi ∣ xi) it’s not a and theta \ theta theta, so you can ignore it. So this simplifies to 1, 2, 3


Arg min Theta. D K L ( P P ) = i ( P ( y i x i ) log P ( y ^ i x i ; Theta. ) \argmin_{\theta} D_{KL}(P^*||P) = – \sum_i (P^*(y_i|x_i)\log P(\hat{y}_i|x_i; \theta)

This is derived from the KL divergence

Maximum likelihood

If we have a sealed box, we just know that there are a certain number of balls in it, and there are red balls and blue balls, but we don’t know the ratio of red to blue balls, so we can take the balls from the box, However, we can only take one ball at a time and then put it back into the box after observation. We can take a certain number of times from the box and predict the ratio of red and blue balls in the box by observing the ratio of red and blue balls. This is the likelihood.

Suppose the observation is C1,C2… C3C_1,C_2 \cdots C_3C1,C2… C3 If the coin experiment is heads or tails, if the red or blue ball, we observe the joint probability of the results, And then we want to find a probability distribution parameter that maximizes the probability of this observation.


P ( C 1 . C 2 . C n Theta. ) = i = 1 n P ( C i Theta. ) P(C_1,C_2,\cdots C_n | \theta) = \prod_{i=1}^n P(C_i|\theta)

We can use the neural network to simulate the probability distribution, where the parameters W,bW,bW and B determine the predicted probability distribution of the model. Assuming that this is a dichotomous problem, it can be considered as the predicted result of Dog and Cat.


P ( y 1 . y 2 . . y n W . b ) = i = 1 n ( P ( y i W . b ) P(y_1,y_2,\cdots,y_n |W,b) = \prod_{i=1}^n (P(y_i|W,b)

y i ^ = N N ( W . b ) P ( y 1 . y 2 . . y n W . b ) = i = 1 n ( P ( y i y ^ i ) \hat{y_i} = NN(W,b)\\ P(y_1,y_2,\cdots,y_n |W,b) = \prod_{i=1}^n (P(y_i|\hat{y}_i)

We can use the neural network model to predict the result yi^\hat{y_i}yi^ to replace the parameters W,bW,bW,b. Since it is the Bernoulli distribution of 0, 1, it can be written in the following form, which can be added to derive the cross entropy formula from likelihood.


i = 1 n y ^ y i ( 1 y ^ ) 1 y i \prod_{i=1}^n \hat{y}^{y_i}(1 – \hat{y})^{1 – y_i}

Arg Max Theta. i = 1 n log y ^ y i ( 1 y ^ ) 1 y i \argmax_{\theta} \sum_{i=1}^n \log \hat{y}^{y_i}(1 – \hat{y})^{1 – y_i}

Arg Max Theta. i = 1 n y i log y ^ + ( 1 y i ) log ( 1 y ^ ) \argmax_{\theta} \sum_{i=1}^n y_i \log \hat{y} + (1-y_i)\log(1 – \hat{y})

Arg min Theta. i = 1 n y i log y ^ + ( 1 y i ) log ( 1 y ^ ) – \argmin_{\theta} \sum_{i=1}^n y_i \log \hat{y} + (1-y_i)\log(1 – \hat{y})