5.1 Logistic regression theory

5.1.1 Introduction to logistic regression

Logistic regression is a classification algorithm, it can deal with binary classification and multivariate classification. Despite the word “regression” in its name, it is not a regression algorithm. So why the misleading word “comeback”? Personally, although logistic regression is a classification model, its principle still has the shadow of regression model. The author has explained linear regression clearly before, and the author will explain logistic regression next.

We know that the linear regression model is to find the linear relationship coefficients θ θ θ between the output eigenvectors Y Y Y and the input sample matrix X X X X, such that Y=Xθ Y=Xθ Y=Xθ. Now our Y, Y, Y is continuous, so it’s a regression model. What if we want Y, Y, and Y to be discrete? Well, one way to think about it is, let’s do another transformation of this Y, Y, Y, g of Y, g of Y. If we set the values of g(Y), g(Y), g(Y) to be category A, A, A for some interval of real numbers, category B, B for another interval of real numbers, and so on, we have A classification model. If the result has only two categories, then it is a binary classification model. This is where logistic regression starts. Let’s start with binary logistic regression.

5.1.2 Binary logistic regression model

For classification by logistic regression, the first thing we need to solve is to find the classification boundary. So what are classified boundary lines?

Figure 1

The above two figures correspond to the linear and nonlinear separability of the samples when the classified samples have two characteristic values x1,x2 X_1, X_2 x1 and x2 respectively. The classification boundary is the green line and green curve shown in the figure. (In this paper, only the linear separable case with n n n features is introduced.)

Using logistic regression to classify is to find such classification boundary, so that it can classify the samples correctly as much as possible, that is, it can separate the two samples as much as possible. Therefore, we can make a bold guess that we can construct such a function (in Figure 1, the feature number is 2, the classification boundary is a straight line, and the classification boundary is “hyperplane” when the feature number is n) to separate the sample set:

Z (x (I)) = θ 0 + θ 1 x 1 (I) + θ 2 x 2 (I) +.. + θ n x n (I) z(x^{(i)})=\theta_0+\theta_1x_1^{(i)}+\theta_2x_2^{(i)}+… + \ theta_nx_n ^ {(I)} z (x (I)) = theta. Theta 0 + 1 x1 (I) + theta 2 x2 (I) +… + theta NXN (I)

Where I =1,2… m I =1,2… , m I = 1, 2,… ,m is the I I I sample, n n n is the feature number, when z (x (I)) &gt; 0 z(x^{(i)})&gt; When 0 z(x(I))>0, corresponding sample points are above the dividing line, which can be divided into “1” class; When z (x (I)) &lt; 0 z(x^{(i)})&lt; When 0 z(x(I))<0, the sample point is below the dividing line, and it is divided into “0” class.

It is easy to think of linear regression, which is also a constructor:

H θ (x (I)) = θ 0 + θ 1 x 1 (I) + θ 2 x 2 (I) +.. + θ n x n (I) h_{\theta}(x^{(i)})=\theta_0+\theta_1x_1^{(i)}+\theta_2x_2^{(i)}+… + \ theta_nx_n ^ {(I)} h theta (x (I)) = theta. Theta 0 + 1 x1 (I) + theta 2 x2 (I) +… + theta NXN (I)

But unlike logistic regression, The linear regression model input the eigenvalue x (I) = [x 1 (I), x 2 (I),.. X n (I)] T x ^ = {(I)} [x_1 ^ {(I)}, x_2 ^ {(I)},…, x_n ^ {(I)}] ^ T x (I) = [x1 (I), x2 (I),…, xn (I)] T, the output is predicted. Logistic regression as a classification algorithm, its output is 0/1. Maybe you’ve seen a function with this property before, called the Heaviside step function, or just the unit step function. However, the problem with the Heviside step function is that it jumps from 0 to 1 at the jump point, which is sometimes difficult to deal with. Fortunately, another function has similar properties and is more mathematically tractable: the Sigmoid function. The sigmoID function will be introduced next. The specific calculation formula of Sigmoid function is as follows:

G (z)=1 1+e−z g(z)=\frac{1}{1+e^{-z}} g(z)=1+e−z1 And as z goes to minus infinity, g(z) g(z) g(z) goes to 0, 0, 0, which fits nicely into our classification probability model. In addition, it has a nice derivative property:

G ‘(z) = g (z) (1 − g (z)); (z) = g (z) (z) (1 – g) g ‘(z) = g (z) (z) – (1 g)

This is easy to get by taking the derivative of the function with respect to g(z), g(z), g(z), and we’re going to use this expression. Figure 2 shows two graphs of the Sigmoid function at different coordinate scales. When x x x is 0, the Sigmoid function value is 0.5. As x x x increases, the corresponding Sigmoid value approaches 1; And as x, x, x decreases, the Sigmoid is going to approach zero. If the abscissa scale is large enough (figure 2 below), the Sigmoid function looks like a step function.

Therefore, in order to implement the Logistic regression classifier, we can multiply each feature by a regression coefficient, then add all the result values, substitute this sum into the Sigmoid function, and then obtain a value in the range of 0~1. Anything greater than 0.5 is classed as 1, and anything less than 0.5 is classed as 0. Therefore, Logistic regression can also be regarded as a probability estimation. As can be seen from the graph of the function, the sigmoid function can map the numbers in (−∞,∞) well to (0,1). So we can divide g(z)≥ 0.5g (z)≥ 0.5g (z)≥0.5 into “1” category, g(z) &lt; 0.5 g (z) & lt; 0.5 g(z)<0.5 can be divided into “0” class. That is:



Where y, y, and y represent the classification result. The sigmoID function actually expresses the probability of classifying the sample into the “1” category,

FIG. 2 Sigmoid function diagram at two coordinate scales

[Note] The abscissa of the figure above is 5 to 5. At this time, the curve changes smoothly. The scale of the x-coordinate below is large enough to see that the Sigmoid function looks a lot like the step function at x=0 x=0 x=0.

The classification boundary has been given in this paper:

Z (x (I)) = θ 0 + θ 1 x 1 (I) + θ 2 x 2 (I) +.. + θ n x n (I) = θ T x z(x^{(i)})=\theta_0+\theta_1x_1^{(i)}+\theta_2x_2^{(i)}+… + \ theta_nx_n ^ = {(I)} \ theta ^ TX z (x (I)) = theta. Theta 0 + 1 x1 (I) + theta 2 x2 (I) +… + theta NXN (I) = theta TX

Among them:

.

And 0 x (I) x_0 ^ {(I)} x0 (I) is biased, n n n characteristic number, I = 1, 2,…, m I = 1, 2,… , m I = 1, 2,… M represents the number of samples.

For comparison with linear regression, the network structure is shown in FIG. 3. Logistic regression is very similar to adaptive linear network. The difference between the two lies in that the activation function of logistic regression is Sigmoid function while that of adaptive linear network is Y =x y=x, and the network structure of the two is shown in FIG. 4. Logistic regression is a linear classification model. The difference between logistic regression and linear regression is that in order to compress a wide range of numbers output by linear regression, such as from negative infinity to positive infinity, to between 0 and 1, such output value can be expressed as “possibility” to convince the general public. Of course, squeezing large values into this range also has the nice advantage of eliminating the effects of particularly prominent variables. All it takes to achieve this great function is the mundane, which is to add a logistic function to the output. In addition, for dichotomies, it can be simply considered as follows: if the probability of sample X x x belongs to positive class is greater than 0.5, then it is judged to be positive class; otherwise, it is negative class.

Figure 3

Figure 4.

Therefore, the LogisticRegression is a linear regression normalized by the Logistic equation. Combined with the sigmoid function above, we can construct the logistic regression model function:

H θ (x (I)) = g (z) = g (θ T x) = 1 1 + e − θ T x H_ {\ theta} (x ^ {} (I)) = g (z) = g (\ theta ^ TX) = \ frac {1} {1 + e ^ {- \ theta ^ TX}} h theta (x (I)) = g (z) = g (theta TX) = 1 + e – theta TX1

Now that we understand the binary classification regression model, we are going to look at the loss function of the model, and our goal is to minimize the loss function to get the corresponding model coefficients θ θ θ.

5.1.3 Cost function of logistic regression

In linear regression, we use the mean square error as the cost function:



Again, suppose we still use the mean square error as a logistic regression function, what happens? If g(z)=1 1+e−z g(z)=\frac{1}{1+e^{-z}} g(z)=1+e−z1, we can find that it is a non-convex function, that is, there are many local minimum points in the function, and it is easy to fall into local minimum points in the process of solving parameters. I can’t really find the minimum.

Figure 5

Why does this happen? That is because logistic regression is not continuous and the experience of natural linear regression loss function definition is not useful. In linear regression, the author uses the maximum likelihood method to derive our loss function, and finally obtains the loss function. Can it also be used for logistic regression? The answer is yes. As mentioned above, logistic regression can be regarded as a probability estimation, and sigmoid function actually expresses the probability of classifying samples into “1” class, so maximum likelihood estimation can be used to derive the loss function, that is to say, Use the sigmoid function to solve the value for the class 1 a posteriori estimate of the p (y = 1 ∣ x, theta) p (y | x = 1, \ theta) p (y = 1 ∣ x, theta), so we can get: P (y = 1 ∣ x, theta) = h theta (x (I)) p (y | x = 1, \ theta) = h_ {\ theta} (x ^ {(I)}) p (y = 1 ∣ x, theta) = h theta (x (I)), p (y = 1 ∣ x. Theta. Theta) = 1 – h (x (I)) p (y | x = 1, \ theta) = 1 – h_ {\ theta} (x ^ {(I)}) p (y = 1 ∣ x, theta) = 1 – h theta (x (I))

The p (y = 1 ∣ x, theta) p (y | x = 1, \ theta) p (y = 1 ∣ x, theta) for sample classification for y = 1 the probability of y = 1 y = 1, And p (y = 0 ∣ x, theta) p (y = 0 | x, \ theta) p (y = 0 ∣ x, theta) for sample classification for y = 0, y = 0, y = 0. In view of the above two equations, we can sort them out as:

P of y given x, Theta. Theta) = h (x) (I) y (1 – h theta) (x (I)) 1 – p (y | x, \ theta) y = h_ {\ theta} (x ^ {(I)}) ^ y (1 – h_ {\ theta} (x ^ {(I)})) ^ {1} y P (y ∣ x, theta) = h theta y (x (I)) (1 – h theta) (x (I)) 1 – y

The likelihood function can be obtained as follows:

L (theta) = ∏ I = 1 m theta (x (I)) (h) y (I) (1 – h theta) (x (I)) 1 – y (I) L(\theta)=\prod_{i=1}^{m}(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} L (theta) = ∏ I = 1 m theta (x (I)) (h) y (I) (1 – h theta) (x (I)) 1 – y (I)

The logarithmic likelihood function is:

L o g l (θ) = ∑ I = 1 m [y (I) l o g (h (x (I)))] + (1 − y (I)) l o g (1 − h (x (I)))] logL(\theta)=\sum_{i=1}^{m}[y^{(i)}log(h_{\theta}(x^{(i)}))]+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] LogL (theta) = ∑ I = 1 m/y (I) the log (h theta (x (I)))) + (1 – y (I)) log (1 – h theta (x (I))))

Thus, we have the cost function, and we can find the value of the parameter θ θ θ by maximizing logL(θ) logL(\theta) logL(θ). In order to facilitate calculation, the cost function is changed as follows:

J (θ) = − 1 m ∑ I = 1 m [y (I) l o g (h θ (x (I)))] + (1 − y (I)) l o g (1 − h θ (x (I)))] J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}log(h_{\theta}(x^{(i)}))]+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] J (theta) = – m1 ∑ I = 1 m/y (I) the log (h theta (x (I)))) + (1 – y (I)) log (1 – h theta (x (I))))

In this case, we only need to minimize J(θ) J(\theta) J(θ) to obtain the parameters θ θ θ.

5.1.4 Optimization algorithm

For the loss function minimization of binary logistic regression, there are many methods, the most common are gradient descent, axis descent, iso-Newton method and so on. Here the formula for each iteration of θ in gradient descent method is derived.

The process of gradient descent is as follows:

repeat {



}

Where α α α is the learning rate, that is, the “step length” of each time; Δ J (theta) Δ theta J \ frac {\ Delta J (\ theta)} {\ Delta \ theta_j} Δ theta Δ J J (theta) is the gradient, J = 1, 2,… N J = 1, 2,… N j = 1, 2,… N.

Next, we solve the gradient:



Among them:



And because:

G ‘(z) = g (z) (1 − g (z)); (z) = g (z) (z) (1 – g) g ‘(z) = g (z) (z) 1 – (g) :



Therefore:



Reason:



From the above, we can get the gradient descent process as follows:

repeat {



}

Where I =1,2… m I =1,2… , m I = 1, 2,… ,m, represents the sample number; J =1,2… n j=1,2… , n j = 1, 2,… N represents the number of features.

5.1.5 Promotion of binary logistic regression: multiple logistic regression

In the previous sections, the model and loss function of logistic regression are limited to binary logistic regression. In fact, the model and loss function of binary logistic regression can be easily extended to multiple logistic regression. For example, always consider one type to be positive and the rest to be zero. This method is the most common one-VS-REST, or OvR. Another method of multiple logistic regression is many-vs-many (MvM), which selects samples from one category and samples from another category to perform logistic regression dichotomies. The most common is one-VS-One (OvO). OvO is a special case of MvM. Each time we chose two samples to do binary logistic regression. First, let’s review binary logistic regression. P of y is equal to one ∣ x, Theta. Theta) = h (x) (I) = 1 + 1 – theta T e x (I) p (y | x = 1, \ theta) = h_ {\ theta} (x ^ {} (I)) = \ frac {1} {1 + e ^ {- \ theta ^ Tx ^ {(I)}}} P (y = 1 ∣ x, theta) = h theta (x (I)) = 1 + e – theta Tx (I) 1

P of y is equal to zero given x, θ) = 1 − h θ (x (I)) = E − θ T x (I) 1 + e − θ T x (I) p(y=0|x,\theta)=1-h_{\theta}(x^{(i)})=\frac{e^{-\theta^Tx^{(i)}}}{1+e^{-\theta^Tx^{(i)}}} P (y = 0 ∣ x, theta) = 1 – h theta (x (I)) = 1 + e – theta Tx – theta Tx (I) (I) e

Where y, y, and y can only be 0 and 1, then:

L n p (y = 1 ∣ x, θ) p (y = 0 ∣ x, Theta) x (I) = theta T ln \ frac {p (y | x = 1, \ theta)} {p (y = 0 | x, \ theta)} = {\ theta ^ Tx ^ {(I)}} LNP (y = 0 ∣ x, theta) p (y = 1 ∣ x, theta) = theta Tx (I)

If we were to generalize to multiple logistic regression, the model would need to be extended a bit. We assume the K, K, K meta-classification model, that is, the sample output y, y, y is 1, 2… , K, 1,2… , K, 1,2… , K. According to the experience of binary logistic regression, we have:

L n p (y = 1 ∣ x, θ) p (y = K ∣ x, Theta) x (I) = theta. 1 T ln \ frac {p (y | x = 1, \ theta)} {p (y | x = K \ theta)} = {\ theta_1 ^ Tx ^ {(I)}} LNP (∣ x, y = K theta) p (y = 1 ∣ x, theta) = theta 1 Tx (I)

L n p (y = 2 ∣ x, θ) p (y = K ∣ x, Theta equals theta 2 T x (I) ln \ frac {p (y = 2 | x, \ theta)} {p (y | x = K \ theta)} = {\ theta_2 ^ Tx ^ {(I)}} LNP (∣ x, y = K theta) p (y = 2 x ∣, theta) = 2 Tx theta (I)

… L n p (y = K − 1, θ) p (y = K ∣ x, Theta) x (I) = theta K – 1 T ln \ frac {p (y | x = K – 1, \ theta)} {p (y | x = K \ theta)} = {\ theta_ {} K – 1 ^ Tx ^ {(I)}} LNP (∣ x, y = K theta) p (y = K – 1 ∣ x, theta) = theta K – 1 tx (I)

It has K−1, K-1, K−1. The equation with the sum of probabilities 1 is as follows:

∑ I = 1 K p (y = I ∣ x, theta) = 1 \ sum_ ^ {I = 1}} {K p (y = | x, I \ theta) = 1 ∑ I = 1 KP (y = I ∣ x, theta) = 1

So we have K, K, K equations with K probability distributions of logistic regression. Solve for the probability distribution of K logistic regression. Solve for the probability distribution of K logistic regression. Solve this set of equations of a degree K, get a system of equations of a degree K, get a system of equations of a degree K, get the probability distribution of K$yuan logistic regression as follows:



The loss function derivation and optimization methods of multiple logistic regression are similar to binary logistic regression, which will not be described here.

Reference: [1] machine learning field www.manning.com/books/machi… [2] Raschka S. Python Machine Learning[M]. Packt Publishing, 2015 [3] Christopher M. Bishop: Pattern Recognition and Machine Learning, Chapter 4.3.4 [4] Mark Schmidt, Nicolas Le Roux, and Francis Bach: Minimizing Finite Sums with the Stochastic Average Gradient. [5] Aaron Defazio, Francis Bach, Simon Lacoste-Julien: SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. [6] Deeplearning.stanford.edu/wiki/index…. [7] www.stat.cmu.edu/~cshalizi/u…