Before continuing to study GBDT (Gradient Boosting Dicision Tree) decision Tree, we need to know Logistic Regression algorithm first, because GBDT is more complex, but it will be easier to understand on the basis of Logistic Regression.

Logistic regression is the most basic algorithm in machine learning and one of the most widely used algorithms in industry. The reason lies in its simplicity, efficiency and practicality.

Although linear regression is also very simple, it is not practical because logistic regression is essentially a probability model. In practical application, it is much more important to predict a probability value between 0 and 1 than to predict a real number of scenarios. For example, in advertising business, we often calculate the probability of users clicking on an advertisement.

Logistic regression is a probabilistic Model, but we can still transform the prediction range of the Model from 0-1 to the range of real numbers through certain transformations, so both logistic regression and Linear regression can be Generalized into the Generalized Linear Model. To understand this transformation, We need to introduce a concept: odds and log(odds).

Odds and log (odds)

According to Wikipedia, the term is mainly used in the fields of gambling and statistics, and it dates back to the 16th century, predating the development of probability theory.

The odds are easy to understand. Take the soccer game as an example, if China plays Brazil, the odds of China winning are 1 and losing is 99, then the odds of China winning are 1/99 and losing is 99. The difference between odds and probability is also easy to see through this example. China has a 0.01 chance of winning against Brazil and a 0.99 chance of losing.

It can also be seen from the above example that the odds won by China team and Brazil team fall in different value ranges. The odds won by China team falls in the range of (0,1), while Brazil team falls in the range of (1,∞). That is to say, when China and Brazil play, the two teams should win or lose the same degree, but 1/99 and 99, their scales are different, it is difficult to make intuitive judgment; And log(odds) is used to solve this problem:

The Chinese team win Brazil win
odds 1/99 99
log(odds) 4.60 4.60

It can be seen that after adding log to odds, the absolute value of log of China winning and Brazil winning is 4.6, that is, the degree of winning and losing is the same, which can be seen at a glance. And when we calculate the log of the odds, we can judge whether the odds are high or low by the sign of plus or minus. For example, -4.6 means that the Chinese team has a low odds. Moreover, when the log odds are 0, there are as many wins as losses.

Log (odds) is a useful metric, and you can write a program that constantly generates random numbers between 0 and 100And then theThe correspondingIf you draw it as a bar, you can see that it is normally distributed:

In practice, we can put the aboveReplace it with the Click or purchase index of a certain website, calculate the corresponding log(odds) distribution according to the historical data, and then find a bunch of related features to fit the distribution. This is what we call CTR (Click Through Ratio) or CVR (Conversion Rate) model, and we put the relevant characteristics of a user into the model to calculate the corresponding log(odds), which is the probability that the user will click or buy a certain product.

At this point, some of you will ask, what does this have to do with logistic regression? In fact, there’s another way to calculate the log of odds:


In fact, it is also easy to understand, which is still the above example. The probability of Chinese team winning is P =0.1, and the log of Chinese team winning is


Let’s take both sides of this equationTo the power of theta, and I get the p value, which is theta


This is known as logistic regression. The expression on the right side of the equation is usually called sigmoid function, while log(odds) is also called logit function. The conversion relationship between them is shown in the figure below, where X can be regarded as an eigenvector.

As can be seen from the figure, if logistic regression is converted into log(odds), there are two obvious changes:

  1. The log of odds is a straight line
  2. Log (odds) broadens the range of logistic regression from (0, 1) to (-∞, +∞)

All of a sudden, it looks like linear regression, but the difference is that the sample of logistic regression only has two values of 0 and 1, which are exactly -∞ and +∞ when converted to log(odds), so that when you use MSE to fit, the Loss you get is always infinite. So it is not feasible to use the method of linear regression to fit logistic regression. In logistic regression, Maximu Likelihood is used as Loss of the model.

Maximum Likelihood

Maximum Likelihood is also an intuitive concept, that is, I now have a bunch of positive samples and negative samples, and what kind of logistic regression curve can I use to fit these samples so as to maximize the product of their probability.

For example, suppose that the left side of the figure below is an experimental data about weight and obesity, where the green dots are normal and the red dots are obese, now model these samples using logistic regression, assuming the optimal model is shown on the right side of the figure below:

Through the calculation of this model, it is assumed that the probability of obesity corresponding to green samples is 0.01, 0.02, 0.03 and 0.9 respectively from left to right. Green samples are normal samples, and the probability that they are not obese needs to be calculated. Therefore, these values should be subtracted from 1, namely, 0.99, 0.98, 0.97 and 0.1. Similarly, the probability that the red sample is obese is calculated to be 0.1, 0.97, 0.98 and 0.99 respectively. Because the curve is already optimal, the product of the probability corresponding to these 8 points — 0.0089 is the maximum that can be obtained from all possible models. So Maximum Likelihood is really just the word for it.

In linear regression, MSE is used to measure the quality of linear model. The smaller MSE is, the better the fitting is. In logistic regression, Maximum Likelihood is used, and the larger the indicator is, the better the model is.

For the sample, when it is a positive sample, the corresponding probability isAnd when it is negative sample, the corresponding probability isTo facilitate calculation, we need to use only one expression to express the two cases:


Here y is the sample value, because there are only two values for y, 0 and 1, and when y is positive sample 1, you plug inAnd when y is negative sample 0, plug in the above equation, so the representation of the probability of each sample is unified, and then the total Likelihood can be expressed well:


In the above formula, n represents n samples, subscript I represents sample I, x is the eigenvector, y is the observed target value,Is the weight of feature vector, which is also the parameter of the model. L is the Likelihood of all samples, and also the Loss function in logistic regression. Our goal is to adjust, in order to maximize L.

Usually we convert the multiplication by log to a sum, and take a negative sign, and convert the maximum to the minimum, as follows:


The next step is to calculate the gradient of Loss, and then modify parameters according to the gradient, and the process of iterative convergence. In order to reduce the discomfort of reading, we will not continue to deduce here, but students who have not deduced, or suggest to perform the calculation in the draft, to deepen their understanding.

Logistic regression and Bayesian classification

The core of Bayesian classification still comes from the classic Bayesian formula:


In the classification problem, what we’re really looking for is the probability that sample X is in category C when it occurs, that is, the type of p (c | x). On the right-hand sideRepresented by other categories than C, p(c) andIn general, you can set them to be equal. For example, we can set the prior probabilities for both dichotomies to be 0.5.

Then, p (x | c) can be represented as observed in the classification of c x samples of probability, in the same way,It is inThe probability of the occurrence of sample X observed in the classification. Hence, p (c | x) is a posterior probability.

Now that we understand Bayesian classification, let’s divide the numerator and the denominator on the right-hand sideThat is as follows:


So, does this look like the sigmoid function? Let’s say:


Set the prior probabilities equal, and take log on both sides of the equation at the same time, then:


Move the minus sign to the right:


Finally, bring z back to the original formula:


The conclusion is that logistic regression is really just Bayesian classification, they are both a posterior probability model.

conclusion

In this paper, we learned the principle of logistic regression algorithm mainly through the two concepts of log(odds) and Bayesian classification, and learned that logistic regression uses Maximum Likelihood as its loss function. I hope you can have a deeper understanding of logistic regression through this paper, just like me.

Reference:

  • Logistic Regression, Clearly Explained
  • Classification
  • Logistic Regression

Related articles:

  • AdaBoost for decision tree algorithm
  • Random forests of decision trees
  • Classification and Regression Trees (CART) for Decision tree algorithm [1]
  • Classification and Regression Trees (CART) for decision tree algorithm [2]