This paper briefly introduces the definition and principle of Logistic Regression. For the Linear Regression model, input XXX, network parameters are WWW and BBB, and the output value is YYy, which is a continuous value. However, the final output value of classification problem should be discrete, so how to transform into classification problem?
Consider adding a sigma sigma function, also called sigmoid or logistic function, to make it y= sigma(wx+b)y= sigma(wx+b)y= sigma(wx+b), so that the output can be compressed to [0,1][0,1][0,1], We can equate this value as probability
For Regression problems, the goal is to make preD (predicted value) approximate the yyY value of the output, that is, minimize dist(pred,y)\text{dist}(pred,y)dist(pred,y); For Classification problems, the goal is to maximize accuracy, Or minimize dist (p theta (∣ x y), pr (y ∣ x)) \ text {dist} (p_ \ theta (y | x), p_r (y | x)) dist (p theta (∣ x y), pr (y ∣ x)). The main difference between the two is that the goal of Traning is different, and there might be a question, why not maximize accuracy?
Because the general ACC calculation formula is: number of predicted pairs total number frac{number of predicted pairs}{total number} Total number of predicted pairs
For a dichotomous problem, we assume that the threshold is 0.5, i.e., PRED >0.5pred>0.5pred>0.5 pred, and consider it to be the first type, pred<0.5pred<0.5pred<0.5 pred, and we consider it to be the second type. At the beginning, there will definitely be incorrect classification. Assuming that the real category is 1 and the predicted value is 0.4, the network will classify it as 0. After the network update, the predicted value becomes 0.45, which is closer to 1 and the real value, but still not greater than 0.5 because there is no essential change
For example, if a predicted value is 0.499 at the beginning and the real category is 1, the predicted value changes to 0.501 after network update, the prediction is correct, but when calculating the gradient, it may cause the gradient to be very, very large or even discontinuous
The above two questions can be summed up as:
- gradient = 0 if accuracy unchanged but weights changed
- gradient not continuous since the number of correct is not continuous
Last question, why call logistic regression? Logistic is easy to understand because sigma \sigma sigma functions are used, but why regression and not classification? There is a lot of debate online about the answer to this question. One explanation is that, because classification was originally done using MSE, it was called regression (commonly used loss is MSE). However, the classification problem is now solved using the Cross Entropy, which is so well known that it doesn’t change