I have not thought about the details carefully before, but I happened to see relevant materials today, so I hereby record them.
LR is a generalized linear regression model. If the loss function is squared, the derivative calculation of Sigmoid function cannot be guaranteed to be convex. In the process of optimization, the solution obtained may be the local minimum rather than the global optimal value. And the second thing is, once you’ve taken the logarithm, it’s easier to take the derivative.
In addition, direct calculation based on the likelihood function has two disadvantages :(1) it is not conducive to the subsequent derivation, and (2) the calculation of the likelihood function will lead to down overflow.
The bottom overflow, the way you think about it, is that you end up multiplying, and each of these values is small, so if you have a lot of samples, you end up spilling down.