“This is the 9th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Logistic regression, also known as Logistic regression analysis, is a generalized linear regression analysis model, which is often used in data mining, economic forecasting and other fields. We know that for regression wTxiw^Tx_iwTxi the output range is real, and for classification problems we want the output to predict the probability of a certain category.

We introduced maximum likelihood estimation and maximum posteriori estimation before, they are based on MLE of frequency school and MAP of Bayes school respectively, and we know that MAP adds a priori on MLE basis. So first of all, let’s think about how do we map a range of real numbers to a probability space of values from 0 to 1, and we need to talk a little bit about sigmoid, right


sigma ( z ) = 1 1 + e z \sigma(z) = \frac{1}{1+e^{-z}}

Let’s look at the graph of the function

This function, whose input is a range of real numbers and whose output is between 0 and 1, converts a probability. Here we’re talking about a probability, let’s say a dichotomous problem, in terms of conditional probability, for example, y equals one and y equals zero for two categories, and this is a Bernoulli distribution, which is a 0, 1 problem.


P 1 ( y = 1 x ) = sigma ( w T x ) = 1 1 + e w T x P 0 ( y = 0 x ) = 1 p 1 = 1 sigma ( w T x ) = e w T x 1 + e w T x P_1(y=1|x) = \sigma(w^Tx) = \frac{1}{1 + e^{-w^Tx}}\\ P_0(y=0|x) = 1 – p_1 = 1 – \sigma(w^Tx) = \frac{e^{-w^Tx}}{1 + e^{-w^Tx}}\\

This is a binomial distribution


p ( y x ) = p 1 y p 0 1 y p(y|x) = p_1^yp_0^{1-y}

Data can be understood as a conditional probability P (Y ∣ X) P (Y | X) P (Y) ∣ X X sample collection, Y for the label, because we in the given data X Y of conditional probability.


w ^ = Arg Max w log P ( Y X ) Arg Max w log i = 1 N P ( y i x i ) \hat{w} = \argmax_w \log P(Y|X)\\ \argmax_w \log \prod_{i=1}^N P(y_i|x_i)\\

The joint probability, because each probability event is independent so np can be written as g ∏ I = 1 (yi ∣ xi) g \ prod_ {I = 1} ^ N P (y_i | x_i) g ∏ I = 1 np (yi ∣ xi)


Arg Max w = i = 1 N ( y i log p 1 + ( 1 y i ) log p 0 ) \argmax_w = \sum_{i=1}^N(y_i \log p_1 + (1-y_i)\log p_0)\\

f ( x i ; w ) = 1 1 + e w T x f(x_i; w) = \frac{1}{1 + e^{-w^Tx}}

Arg Max w = i = 1 N ( y i log f ( x i ; w ) + ( 1 y i ) log ( 1 f ( x i ; w ) ) ) \argmax_w = \sum_{i=1}^N(y_i \log f(x_i; w) + (1-y_i)\log (1 – f(x_i; w)))\\

So to translate it into a problem, adding a minus sign is the cross entropy, so that we can do the logistic regression for the classification problem