Machine learning basics - cross entropy loss function

Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Today we derive the formula of the cross entropy loss function in two ways, and in the process explain why cross entropy can be used to classify the loss function of the problem.

Maximum likelihood estimation

In statistics, Maximum Likelihood Estimation (MLE), also known as Maximum Likelihood Estimation, is a method used to estimate the parameters of a probability model.

KL divergence

KL divergence was introduced as a measure of the difference between two probability distributions.

D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}

Measure the distance between Q(I)Q(I)Q(I) and P(I)P(I)P(I)P(I). So for multiple distributions,

D_{KL}(P^*(y_i|x_i)||P(\hat{y}_i|x_i; \theta))

P (y ^ I ∣ xi; Theta) P (\ hat {} y _i | x_i; \ theta) P (y ^ I ∣ xi; Where θ\thetaθ is the distribution parameter, xix_ixi is the ith sample, which can be a picture or sample, θ\thetaθ is the parameter of the neural network, and y^ hat{y}y^ is the prediction result of the neural network
P ∗ (yi ∣ xi) P ^ * (y_i | x_i) P ∗ (yi ∣ xi) according to the real distribution of data

D_{KL}(P^*(y_i|x_i)||P(\hat{y}_i|x_i; \theta)) = \sum_i P^*(y_i|x_i)\log \frac{P^*(y_i|x_i)}{P(\hat{y}_i|x_i; \theta})

Let’s simplify it a little bit further

D_{KL}(P^*||P) = \sum_i (P^*(y_i|x_i)\left[ \log P^*(y_i|x_i) – \log P(\hat{y}_i|x_i;\theta) \right]

D_{KL}(P^*||P) = \sum_i (P^*(y_i|x_i) \log P^*(y_i|x_i) – \sum_i (P^*(y_i|x_i)\log P(\hat{y}_i|x_i; \theta)

∑ I (P ∗ (yi ∣ xi) log ⁡ P ∗ (yi ∣ xi) \ sum_i (P ^ * (y_i | x_i) \ log P ^ * (y_i | x_i) ∑ I (P ∗ (yi ∣ xi) logP ∗ (yi ∣ xi) it’s not a and theta \ theta theta, so you can ignore it. So this simplifies to 1, 2, 3

\argmin_{\theta} D_{KL}(P^*||P) = – \sum_i (P^*(y_i|x_i)\log P(\hat{y}_i|x_i; \theta)

This is derived from the KL divergence

Maximum likelihood

If we have a sealed box, we just know that there are a certain number of balls in it, and there are red balls and blue balls, but we don’t know the ratio of red to blue balls, so we can take the balls from the box, However, we can only take one ball at a time and then put it back into the box after observation. We can take a certain number of times from the box and predict the ratio of red and blue balls in the box by observing the ratio of red and blue balls. This is the likelihood.

Suppose the observation is C1,C2… C3C_1,C_2 \cdots C_3C1,C2… C3 If the coin experiment is heads or tails, if the red or blue ball, we observe the joint probability of the results, And then we want to find a probability distribution parameter that maximizes the probability of this observation.

P(C_1,C_2,\cdots C_n | \theta) = \prod_{i=1}^n P(C_i|\theta)

We can use the neural network to simulate the probability distribution, where the parameters W,bW,bW and B determine the predicted probability distribution of the model. Assuming that this is a dichotomous problem, it can be considered as the predicted result of Dog and Cat.

P(y_1,y_2,\cdots,y_n |W,b) = \prod_{i=1}^n (P(y_i|W,b)

\hat{y_i} = NN(W,b)\\ P(y_1,y_2,\cdots,y_n |W,b) = \prod_{i=1}^n (P(y_i|\hat{y}_i)

We can use the neural network model to predict the result yi^\hat{y_i}yi^ to replace the parameters W,bW,bW,b. Since it is the Bernoulli distribution of 0, 1, it can be written in the following form, which can be added to derive the cross entropy formula from likelihood.

\prod_{i=1}^n \hat{y}^{y_i}(1 – \hat{y})^{1 – y_i}

\argmax_{\theta} \sum_{i=1}^n \log \hat{y}^{y_i}(1 – \hat{y})^{1 – y_i}

\argmax_{\theta} \sum_{i=1}^n y_i \log \hat{y} + (1-y_i)\log(1 – \hat{y})

– \argmin_{\theta} \sum_{i=1}^n y_i \log \hat{y} + (1-y_i)\log(1 – \hat{y})

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Machine learning basics – cross entropy loss function

Maximum likelihood estimation

KL divergence

Maximum likelihood

Machine learning basics – cross entropy loss function

Maximum likelihood estimation

KL divergence

Maximum likelihood

Related Posts

[Threshold segmentation] Adaptive multi-threshold image segmentation based on MATLAB genetic algorithm

Advanced mathematics – Indefinite integrals in calculus

Recreate SENet with the simplest code, beginners must not miss it (PyTorch)