So today we looked briefly at a very basic algorithm in machine learning called logistic regression, which is regression, but it’s really classification.
Remember the objective function of linear regression: Y=wX+b, can we fit the classification problem into this? For example, when Y is 0 or more, the category is 1, and when Y is less than 0, the category is 0. In essence, this is a direct regression line to divide the data, but there are several problems with this method:
1. Even if the classification is correct, if some points are far from others, these deviations will force the regression equation to move to accommodate them, but they are in fact correctly classified.
2. If there are multiple categories, one with a prediction target of 1, one with a prediction target of 2, and one with a prediction target of 3, there is clearly an implicit assumption that category 1 is “closer” to category 2 and “farther” from category 3, which is not an assumption we want to make for a classification problem.
So, we need a better tool for classification. A simple idea, I don’t use mSE when I’m optimizing the loss function of linear regression, because MSE causes the regression equation to “cater” to further points. At this time, as what we are doing is a classification problem, it is easy to have such an idea: the wrong classification is loss, while the correct classification does not constitute loss. The problem of this function is that it cannot be differentiated. In fact, perceptron and SVM use this method. But we’re going to do something else here.
First, we propose a function sigmoid=1/(1+e^{-z}) without derivation. In fact, this function can be derived using a classified production model based on the Gaussian distribution. Looking at this function, we find that sigmoid has many advantages:
1. The output range between 0 and 1 is actually the probability of the category;
2. It is insensitive to outliers, that is, the outer edge has little influence on the change of probability.
So we can use sigmoid as our target function, and z is equal to wX plus b.
Then, it is the turn to use what loss function, here is to do the classification problem, so in order to optimize as much as possible, use the cross entropy formula, from another perspective is Bernoulli distribution calculation probability, loss=-ylog(y_pred)-(1-y)log(1-y_pred). Of course, you can also use the mean square error loss function. The consequence of the mean square error loss function is that the solution is slow, and the gradient descent is much slower.
Finally, as long as the partial derivative of Loss is obtained, the corresponding gradient can be calculated, and then the gradient regression can be used to optimize the solution to obtain the parameters of the model, and the sigmoID function can be used to predict.
Sigmoid is not only the “key man” of logistic regression, but also plays an important role in deep learning. Sigmoid is also one of the last activation functions to be considered when deep neural networks are classified.