My website publicity display effect is better, welcome to visit: lulaoshi.info/machine-lea…

Logistic Regression has been widely used in Internet business for search, recommendation and click estimation of advertisements. It can be said that it is the most frequently used machine learning model and also the basis of deep neural network. In some new machine learning interviews, interviewers often examine the basic formula of Logistic Regression and the derivation of loss function.

From regression to classification

The regression problem refers to the whole field of real numbers and the classification problem refers to the finite discrete values.

The previous articles systematically discussed linear regression models:


This is a regression model. The model predictsThe target value of the range. When solving the model, we can define the loss function by using the error square, and minimize the loss function to obtain the model parameters.

Now, we want binary categorization, and the target value has two options, 0 and 1. A dichotomous function can be expressed as:


when, the classification target is judged as a negative example, when, the classification target is judged as a positive example. The classification function is actually a step function inDiscontinuous, or atIt’s not easy to differentiate a function like this. We need to replace this binary classification function with some other monotonically differentiable function.

Now, we can build on this linear regression and put a function around it. One of the most common functions is


The shape of this function is shown below, and it is called the logarithmic probability function, the Logistic function, or the Sigmoid function, which we will refer to later.

As can be seen from the figure, the Logistic function has some properties:

  • The domain of the function is zero, the range of.
  • whenTend to beWhen,Tend to be; whenTend to beWhen,Tend to be; whentakeWhen,Is equal to the.
  • The whole function is s-shaped.
  • The function is monotonically differentiable.

Strictly speaking, the Sigmoid function is a large family of functions used to represent s-shaped functions. The Logistic function we are discussing now is one of the Sigmoid functions, which is also the most representative one. Sigmoid function will play an important role in neural network.

These properties of the Logistic function dictate that it can be convertedMap toPhi, plus phi evaluated at the center of phiCan be used to classify. Because the Logistic function has a clear dividing line,The part less than 0 will be divided into negative cases (0),Those greater than 0 will be divided into positive examples (1).

By fitting linear regression into Logistic function, we can get:


We add a Logistic function on the basis of linear regression, so we can make binary classification prediction. One training center hasData, number onePieces of data were fitted according to the following formula:


This is Logistic Regression, Logistic Regression.

Note that in Logistic Regression, although the term Regression is included in the name, this is actually a well-known classification model.

Binary probability interpretation of Logistic function

The Logistic function is suitable for dichotomous probability. Let’s say we’re going toRepresents the possibility of being a positive example when classified, thenIt’s the possibility of splitting into negative examples. Just the Logistic Regression has the following properties:


Among them,Known as Odds, the relative probability that the current data will be classified as positive.It’s the logarithm of Odds, which is called the Log Odds, or Logit.

Let’s review probability: we know that probabilities areThe value on the interval, assuming that the probability of success of something is zero, the probability of failure is. So the Odds of success are:. In other words, it has a very good chance of success.

So back to Logistic Regression, linear RegressionI’m trying to approximate the log of the probability. In fact, the Logitstic Regression models the likelihood of categorization, yielding approximate probabilistic predictions. This model is used in many tasks based on probabilistic assisted decision making. For example, many companies, including Google, have used Logistic Regression to predict whether an Internet AD will be clicked: during training, the AD will be clicked as a positive case, otherwise as a negative case; The higher the prediction, the more prominently the AD will be placed to attract clicks.

Logistic function willMapped toAny value, no matter how large or small, can be combined with oneInterval probabilities, and you get a probability distribution.

Maximum likelihood estimation for Logistic Regression

The Logistic function can be related to the probability, so we can relate the probabilityRegarded as the probability estimation of classification to positive examples:, the probability of classification to negative cases is:.



The above two probabilities can be written into a more compact formula:


Due to theThere are only two possibilities, namely 0 (negative case) and 1 (positive case) : then if..If the... In the above formula, semicolons andSaid,It’s a parameter, not a random variable.

With probabilistic representation, it is easy to perform probabilistic maximum likelihood estimates. Because the likelihood function is almost similar to the probability function, the probability function is the product of the probability of all samples occurring, and the likelihood function is about parametersThe function.


And just like linear regression, let’s take the formula above, it is easier to maximize the likelihood function:


How do I find the solution to the above formula? Just like linear regression, we can use gradient ascent. The current goal is to maximize the likelihood function, so we use gradient ascent, iterating to find the maximum. In particular, parameters are updated as follows:


The key of parameter estimation is to get the derivative formula. Before taking the derivative, let’s review the Logistic Regression again:



The Logistic functionWhen taking the derivative:Because:


And then we’re going to take the derivatives of the parameters. Let’s still assume that there is only one piece of data in the training set. The third line of the derivation uses the derivative property of the Logistic function.


So, specific to the parameter iteration update formula, in order to train the firstSample data were used for calculation:


It’s exactly the same formula that we derived for the linear regression function. So, in this case, we can use the gradient ascent method to get the optimal solution. Or, to do a simple transformation, gradient descent:


The previous formula just assumes that there is only one sample data in the training set, while when the training set hasPiece of data, yesTaking the derivative, we can actually get:


It is not practical to update parameters with full data directly. In most cases, stochastic gradient descent method is used to solve the problem. A sample can be randomly selected to update parameters, or a small batch of Mini-batch samples can be randomly selected to update parameters.

Classification threshold

The final return of Logistic Regression is probability. We can either directly use this predicted probability or set thresholds to transform the probability into a binary classification problem.

For example, if the probability of a user clicking on an AD is predicted to be 0.00023, which is much higher than other ads, the AD will be placed in the most prominent position and most likely to be clicked by the user.

We can also convert the returned probability into a binary value, for example, predicting that an E-mail message has a high probability of being spam and deciding that it is spam. If the logistic regression model returns a probability of 0.9995 when predicting an E-mail, it means that the model predicts that the E-mail is very likely to be spam. Conversely, another E-mail predicted with a score of 0.0003 in the same logistic regression model is likely not spam. But what if the predicted score for an email is 0.6? In order to map logistic regression values to binary categories, we must specify a classification Threshold (Threshold, also known as a decision Threshold). If the return value of logistic regression is higher than the threshold, it indicates spam. If the value is lower than the threshold, it indicates non-spam.

The classification threshold can be set to 0.5. In fact, the threshold depends on the specific service scenario and the proportion of positive and negative samples. The selection of thresholds will definitely affect the prediction results. One method is to select different thresholds for testing, calculate accuracy and Recall under different thresholds, and then select the best threshold.

Class training data imbalance problem

The binary classification of Logistic function introduced above always assumes that the number of training samples of different categories is the same. If the number of positive and negative samples is not very different, it usually has no great impact on the results, but if the number of positive and negative samples is very different, it will cause trouble to the learning process. For example, if a data set has 998 negative cases but only 2 positive cases, the model can predict negative cases every time and achieve 99.8% accuracy, but such a model is actually of no value. In practical classification problems, category imbalance is often encountered, such as:

  • In the spam identification scenario, the number of spam messages is relatively small compared with normal ones.
  • In the credit-card fraud detection scenario, the vast majority of transactions are normal and only a small number of transactions are problematic.
  • In the case of machine fault alarm, the machine runs normally in most cases, but the failure time is far less than the normal operation time.

In logistic regression, we use the predictedValue is compared to a threshold value, usually inIs a positive case, otherwise is a negative case.It actually expresses the probability, the probability, of this data being a positive exampleReflects the ratio of positive case possibility to negative case possibility, whenChance,It shows that positive and negative possibilities are equal. Based on this assumption, if, the prediction is a positive example.

However, when there is a large difference in the number of positive and negative examples in the training set, letIs the number of positive examples,Is the number of negative cases, then the probability observed in the training set is. So the probability of prediction is greater than the probability of training set, or, should be considered as a positive example. Therefore, it is clearly inappropriate to still use 0.5 as the threshold in this case.

In order to solve the problem of unbalanced data, we need to adjust the training set in two ways:

  • Undersampling or Oversampling is carried out on the training set to reduce or increase data so that the number of positive and negative examples in the training set is close, and then learning is carried out.
  • The original training set is directly used for learning, but 0.5 is no longer used as the classification threshold. Instead, the threshold is adjusted according to business scenarios and the number of positive and negative samples.

How to determine positive and negative samples needs to be discussed according to specific problems. Undersampling will discard some data and make the training set much smaller than the initial data, which may lose some important information and affect the effect of the model. Oversampling will falsify some of the data, but you cannot simply copy samples from the original data, otherwise it will result in overfitting. SMOTE (Synthetic Minority over-sampling Technique) is the representative algorithm of oversampling, the core idea of which is to forge data by interpolation. For example, assume that all characteristics of the data sets are sequential, a feature space can be formed, from the data set to find a sample points first, then find an adjacent congener samples, and in the feature space between the two points can form a straight line, find a random point on this line is used to generate new data, to join the training focus.

Of course, if the sample data is not balanced, it may be more appropriate to choose another algorithm model, such as decision tree model. The principle of decision tree determines its better performance on unbalanced samples.

The resources

  1. Andrew Ng: CS229 Lecture Notes
  2. Zhou Zhihua: Machine learning
  3. Developers.google.com/machine-lea…