Cross entropy loss is one of the most widely used loss functions in deep learning. This powerful loss function is based on the concept of cross entropy. When I started using this loss function, I had a hard time understanding the intuition behind it. After googling various materials, I was able to get a satisfactory understanding of it and I wanted to share it in this article.
To fully understand, we need to understand the concepts in the following order: self-information, entropy, cross entropy, and cross entropy loss
Since the information
“How surprised were you by the results?”
A result with low probability brings more information than a result with high probability. Now, if $y_i$is the probability of the ith result, then we can express the self information S as:
entropy
Now THAT I know that an event produces self-information for a certain outcome, I want to know how much self-information that event brings on average. The weighted average of self-information S is intuitive. Now the question is what weight to choose? Because I know the probability of each outcome, it makes sense to use the probability as the weight, because that’s the probability that each outcome should happen. The weighted average value of self-information is entropy (e). If there are n results, it can be written as:
The cross entropy
Now, what if the actual probability of each outcome is $PI $and someone estimates the probability to be $QI $. In this case, each event will happen with a probability of $PI $, but the self-information in the formula will be changed to $QI $(because people think the probability of the outcome is $q_i$). Now, in this case, the weighted average self-information becomes the cross entropy C, which can be written as:
Entropy, cross entropy is always greater than the only in each of the following is the same as the entropy $PI = qi $, you can watch https://www.desmos.com/calculator/zytm2sf56e illustrations to help understanding
Cross entropy loss
The purple line represents the area under the blue curve, the estimated probability distribution (orange line), the actual probability distribution (red line)
In the graph I mentioned above, you will notice that the cross entropy increases as the estimated probability distribution deviates from the actual/expected probability distribution, and vice versa. Therefore, we can say that minimizing cross entropy will bring us closer to the actual/expected distribution, which is what we want. This is why we try to reduce cross entropy so that our predicted probability distribution eventually approximates the actual distribution. Therefore, the formula of cross entropy loss can be obtained as follows:
In the case of a dichotomous problem with only two classes, we named it as dichotomous cross entropy loss, and the above formula becomes:
Rock and the AI technology blog resources summary station: http://docs.panchuang.net/PyTorch, the official Chinese tutorial station: Chinese official document: http://pytorch.panchuang.net/OpenCV http://woshicver.com/