So far, WE have completed support vector machine SVM, decision tree, KNN, Bayesian, linear regression and Logistic regression. For other algorithms, please allow Taoye to give credit here for the first time. Later, we will have the opportunity and time to make up for you.

Update so far, also received part of the reader’s praise. It’s not much, but thank you very much for your support, and I hope everyone who reads it will find it rewarding.

The entire content of this series of articles is Taoye pure hand written, and also references a number of books and open sources. The total number of words in this series is around 15W (including source code), which will be gradually filled in later. More technical articles can be found on Taoye’s official account: Cynical Coder. The document can be circulated freely, but be careful not to modify its contents.

If you have any questions you don’t understand in the article, you can directly ask them, and Taoye will reply as soon as you see them. Meanwhile, you are welcome to come here to privately urge Taoye: Cynical Coder. Taoye’s personal contact information is also available on the public account. There are some things Taoye can only secretly say to you there (# ‘O’)

To improve your reading experience, Taoye, a series of hand-ripping machine learning articles, has been compiled into PDF and HTML, and is available for free by replying to [666] on the public id [Cynical Coder].

In the last article of the hand-tearing machine learning series, we explained linear regression in detail, and finally fitted a straight line through the gradient descent algorithm, so as to make the straight line fit the data sample set as much as possible, and achieved the purpose of minimizing the model loss value.

In this article, we mainly focus on Logistic regression, which is also called Logistic regression in The book Statistical Learning Methods by Li Hang. Hearing the word “regression,” some readers may think that the last linear regression solved the fitting problem, while this article is about Logistic regression, is it also a fitting problem? It is just that the principles of algorithms used are different, and the problems solved are the same??

In fact, the Logistic regression model is a generalized linear regression model, mainly aimed at the classification problem. It’s just that the classification model has some similarities with the fitting model in the previous chapter, or we can say that if you understand linear regression in the previous article, then the Logistic regression in this article will be very easy for you to learn. That’s why Taoye’s liver was linear regression in the first place.

So far, there have been eight updates in the Manual Machine learning series, and readers can “recharge” themselves as needed (updates are ongoing) :

  • The Machine Learning in Action “, analyze the support vector Machine, single hand crazy tore linear SVM: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “, analyze the support vector Machine (SVM), optimization of the SMO: www.zybuluo.com/tianxingjia…
  • Machine Learning in Action — Understand what you know and understand what you don’t know. Nonlinear support vector machine (SVM) : www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – what Taoye tell you about the decision tree is a “ghost” : www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “- children, come to play, you of the decision tree: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – a female classmate ask Taoye, KNN should how to play to customs clearance: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action – a colloquialism bayesian, “melon” masses should be exactly the melon or just bad melon: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “, introduction to linear regression of those things: www.zybuluo.com/tianxingjia…
  • The Machine Learning in Action “- Taoye tell you something about Logistic regression is what happened: www.zybuluo.com/tianxingjia…

This paper mainly includes the following two parts:

  • Make friends with Logistic regression and get to know each other (principle analysis, formula derivation)
  • The binary classification problem was solved based on Logistic regression

1. Make friends with Logistic regression and get to know each other (Principle analysis)

We’ve already seen quite a bit about classification algorithms in machine learning. Logistic regression is also a member of the classification algorithm, which is generally used for binary classification problems, such as whether the patient has stomach cancer, whether it will rain tomorrow and so on. Of course, for the problem of multiple classification, Logistic regression also has a solution, after all, you have a good plan, I also have a wall. This paper mainly takes dichotomies as an example to analyze the little secrets in Logistic regression.

Assuming that there are some sample data points, each sample includes two attributes, we can use a straight line to fit them, the fitting process is called regression, and the fitting effect is shown as follows:

This is also our detailed talk about in the previous article, the content of the specific visible: the Machine Learning in Action “, introduction to linear regression of those things: www.zybuluo.com/tianxingjia…

Logistic regression is a classification algorithm, the core idea of which is based on linear regression, and it is expanded, mainly using the Sigmoid function threshold at [0,1], which is exactly in line with the probability interval. Therefore, the essence of Logistic regression is a discriminant model based on conditional probability (to determine which category the sample belongs to on the premise of knowing the attribute characteristics of the sample).

Now let’s get acquainted with Logistic regression and see if we can unearth some of its secrets.

Do you still remember that when we explained SVM, we used a form of interval maximization to find the best decision surface, so as to classify data sets.

In the view of Logistic regression, the category label of each sample corresponds to a probability, which category label has a high probability, so I will put the sample into which category. We might as well assume that a single sample has N+1N+1N+1 attribute features, which can be regarded as a vector form, and there are only two types of labels corresponding to each sample, namely:


x i = ( x i ( 0 ) . x i ( 1 ) . x i ( 2 ) . . . . . x i ( N ) ) y i ( 0 . 1 ) \begin{aligned} & x_i = (x_i^{(0)},x_i^{(1)},x_i^{(2)},… ,x_i^{(N)}) \\ &y_i\in (0,1) \end{aligned}

In other words, what we want to do now is to find a hyperplane with a category sample on one side and a category sample on the other. And each sample can only exist in two cases, non-0 or 1. For this, we can assume that given sample attribute characteristics and model parameters W =(w0,w1… ,wN)w=(w_0,w_1,… ,w_N)w=(w0,w1,… ,wN), the probability that the sample label category is 1 is hw(x) h_W (x)hw(x). Since only two categories can exist, the probability that the sample label is 0 is 1− Hw (x) 1-H_W (x)1− Hw (x), namely:


P ( y i = 1 x i ; w ) = h w ( x i ) P ( y i = 0 x i ; w ) = 1 h w ( x i ) \begin{aligned} & P(y_i=1|x_i; w)=h_w(x_i) \\ & P(y_i=0|x_i; w)=1-h_w(x_i) \end{aligned}

For the above formula, it means that the probability of the sample label being 1 and 0 is hw(x)h_w(x)hw(x) and 1− Hw (x) 1-H_W (x)1-h_w(x)1−hw(x), respectively, given the sample attribute feature X and model parameter WWW. According to our own wishes, of course, we want the difference between the two as big as possible, so that the classification results can be more convincing.

For example, for a sample, the probability of being labeled as 0 is 0.9, and the probability of being 1 is equal to 0.1. Of course, we prefer to put this sample into the category of 0, and such classification results are easier to accept. Suppose that the probability of the sample being labeled 0 is 0.51, and the probability of the sample being 1 is 0.49, then which category would you like to place the sample in?? Is it hard to make a decision? Is it rather confusing?? Because the classification probability result obtained in this way is similar to the probability of our blind guess, completely unable to reach the purpose of a crowd.

In this regard, we hope that the greater the difference in classification probability of labels, the better, so that our classification results of samples will be more convincing.

The above two probabilities are actually two situations, and their sample labels are either 0 or 1. According to this feature, in order to facilitate the representation of classified probability, the above two probability values can be combined as follows:


P ( y i x i ; w ) = h w ( x i ) y i ( 1 h w ( x i ) ) 1 y i P(y_i|x_i; w)=h_w(x_i)^{y_i}(1-h_w(x_i))^{1-y_i}

This method of combining the two into one has been seen in the article of SVM. If forgotten, readers can jump to review: “Machine Learning in Action” — Analyzing support vector machines, tearing linear SVM with one hand: www.zybuluo.com/tianxingjia…

When y is equal to 1, the (1-y) term (second term) is 0; When y is 0, the y term is 0. In other words, our goal now is to maximize the value of the above equation according to the training sample set, so that we can classify the sample set with higher accuracy.

And then P of y given xi; w)P(y|x_i; W) P (y ∣ xi; W) the probability value represented by a single sample, and as we know, the sample set is a set composed of a large number of single samples, so our maximum likelihood method should come out at this time. Assuming that the samples are independent from each other, the likelihood function is (assuming n samples) :


i = 1 n h w ( x i ) y i ( 1 h w ( x i ) ) 1 y i \prod_{i=1}^nh_w(x_i)^{y_i}(1-h_w(x_i))^{1-y_i}

At this point, our goal is to solve for the maximum of the above likelihood function based on the data sample set. To solve the maximum value problem, we will naturally use to solve the derivative function, but for the above product form, we directly take the derivative is not simple, and will greatly improve the efficiency and complexity of solving. So we need to take a logarithm of this, so that we can convert the product form into the sum form, and that makes it easier to take derivatives.

Assuming that we name the above formula L(w)L(w)L(w) after logarithmic transformation, it is our final loss function, or the objective function to be optimized, in the following form:


L ( w ) = i = 1 n y i l o g ( h w ( x ) ) + ( 1 y i ) l o g ( 1 h w ( x ) ) L(w)=\sum_{i=1}^ny_ilog(h_w(x))+(1-y_i)log(1-h_w(x))

Here, we have obtained the final objective function to be solved. Now we need to maximize the value of the loss function mentioned above, so as to make the final classification result set more accurate.

Note: for the above loss function, readers may have a question here, it is reasonable to minimize the loss value, why is it necessary to maximize the above formula?? In fact, the above formula is said to be a loss function, but what it really means is to improve the classification accuracy of the whole sample set as much as possible. If you have any doubts about this point, you can go back and think about the derivation of the formula and what it really means.

So there’s another hw(x), h_w(x), hw(x) that we don’t know about the loss function. Through the previous analysis, we can also know that hw(x)h_w(x)hw(x) represents a probability, and its value is 0-1. However, we calculate w0xi(0)+ W1xi (1)+w2xi(2)+… +wNxi(N)w_0x_i^{(0)}+w_1x_i^{(1)}+w_2x_i^{(2)}+… +w_Nx_i^{(N)}w0xi(0)+w1xi(1)+w2xi(2)+… +wNxi(N), you can see that the specific range of this value is uncertain. To this end, we need to process the calculated value and convert it to the range between 0 and 1, so as to conform to the range characteristics of probability.

So, how to deal with??

Clever researchers have discovered that there is a function, no matter how big it is, or how small it is, that maps it to zero minus one. This function is known as Sigmoid, and its form and image are shown below:


g ( z ) = 1 1 + e z g(z)=\frac{1}{1+e^{-z}}

Through the Sigmoid function expression and specific image, we can also find that it just meets our actual needs. In addition, Sigmoid function will be frequently seen in the future learning process. For example, in convolutional neural network, Sigmoid function is often used as our activation function to solve nonlinear problems.

In this way, we use Sigmoid function to process the sample and get the following results:


h w ( x ) = g ( w T x ) = 1 1 + e w T x h_w(x)=g(w^Tx)=\frac{1}{1+e^{-w^Tx}}

After understanding the specific expression of Hw (x) h_W (x)hw(x), we can further process and transform the loss function, and the processing process is as follows:


L ( w ) = i = 1 n y i l o g ( h w ( x i ) ) + ( 1 y i ) l o g ( 1 h w ( x i ) ) = i = 1 n [ y i log h w ( x i ) 1 h w ( x i ) + l o g ( 1 h w ( x i ) ) ] = i = 1 n [ y i ( w T x i ) w T x i l o g ( 1 + e w T x i ) ] \begin{aligned} L(w) & =\sum_{i=1}^ny_ilog(h_w(x_i))+(1-y_i)log(1-h_w(x_i)) \\ & = \sum_{i=1}^n[y_i\log\frac{h_w(x_i)}{1-h_w(x_i)}+log(1-h_w(x_i))] \\ & = \sum_{i=1}^n[y_i(w^Tx_i)-w^Tx_i-log(1+e^{-w^Tx_i})] \end{aligned}

After processing the loss function into the above formula, we can find that in the entire training data sample set, Xi and Yix_i, y_ixi and Yi are known, and the only uncertainty is WWW, which is a vector form corresponding to the attribute characteristics of a single sample. We can get the following results after taking the derivative of WWW:


partial L ( w ) partial w = i = 1 n ( y i h w ( x i ) ) x i \frac{\partial L(w)}{\partial w}=\sum_{i=1}^n(y_i-h_w(x_i))x_i

Thus, after we calculate the gradient of the loss function with respect to WWW, we can update and iterate the WWW parameters continuously through the gradient rise algorithm, so as to maximize the value of the loss function, even if the accuracy of the training sample set classification is as high as possible.

In addition, we know that the WWW parameter is actually a vector form, which is corresponding to the attribute characteristics of XXX. For this, when we update the WWW, we update every element inside it at the same time. According to the above derivative results, we get the specific updates of each element as follows:


w 0 n e w = w 0 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 0 w 1 n e w = w 1 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 1 w 2 n e w = w 2 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 2 w N n e w = w N o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x N 2 \begin{aligned} & w_0^{new}=w_0^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^0 \\ & w_1^{new}=w_1^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^1 \\ & w_2^{new}=w_2^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^2 \\ & \vdots \\ & w_N^{new}=w_N^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_N^2 \\ \end{aligned}

Note that here we assume that each element has N+1N+1N+1 attribute characteristics, i.e. Xi =(xi(0),xi(1),xi(2)… ,xi(N))x_i = (x_i^{(0)},x_i^{(1)},x_i^{(2)},… ,x_i^{(N)})xi=(xi(0),xi(1),xi(2),… ,xi(N)), which was also mentioned earlier. Where oldoldold represents the WWW value before the update iteration, newnewNew represents the WWW value after the update iteration, and α\alphaα represents a learning rate, which also represents the speed of learning, which was explained in detail in the previous linear regression.

2. Solving dichotomous problems based on Logistic regression

After obtaining the Logistic regression model, we can continuously update and iterate according to Woldw ^{old}wold to obtain wneww^{new}wnew, and finally maximize the loss function value.

Next, we try to realize Logistic regression classification through Python code, and this time we mainly focus on dichotomies. The data set is still randomly generated by NumPy, and the code defining an ESTABLish_data method to randomly generate the data set is as follows:

"" Author: Taoye wechat 下 载 : Coder Explain: Using NumPy to prepare data set Return: X_data: attribute characteristics of the sample y_label: the corresponding label of the attribute characteristics of the sample """ def establish_data(): X_data = np.concatenate((np.add(Np.random. Randn (50, 2), [1.5, 1.5])), Np.subtract (NP.random.randn (50, 2), [1.5, 1.5])), axis = 0) # Random generate data randomly, Y_label = Np.concatenate ((np.zeros([50]), NP.ones ([50])), Axis = 0) # concatenate merge data set return x_data.y_labelCopy the code

The distribution of the visualized data is as follows:

The figure above shows a rough distribution of the data and a straight line separating the two types of data. Here, we assume that the input of Sigmoid function is denoted as ZZZ, then z= W0x0 + W1x1 + W2x2z = W_0x_0 + W_1X_1 + W_2X_2z =w0x0+ W1X1 + W2x2, and the data can be split. Where, in order to reflect the intercept of the line, we take x0x_0x0 as a fixed value of 1, x1x_1x1 as the first attribute feature of the dataset, and x2x_2x2 as the second attribute feature of the dataset. If z=0, the general expression w0+ W1x1 + W2x2 = 0W_0 + W_1x_1 + W_2X_2 = 0W0 + W1x1 + W2x2 =0.

So for this equation, what we know is the sample data, so the x-coordinate is x1x_1x1, and the y-coordinate is x2x_2x2, representing two properties of the sample. The unknown parameters are w0, W1, W2W_0, W_1, W_2W0, W1, w2, which are the regression coefficients (optimal parameters) we need to ask, and are also the model parameters that need to be trained by the gradient ascent algorithm.

Before starting to train model parameters, let’s pull out the iteratively updated WWW formula again:


partial L ( w ) partial w = i = 1 n ( y i h w ( x i ) ) x i \frac{\partial L(w)}{\partial w}=\sum_{i=1}^n(y_i-h_w(x_i))x_i

That is:


w 0 n e w = w 0 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 0 w 1 n e w = w 1 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 1 w 2 n e w = w 2 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 2 w N n e w = w N o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i N \begin{aligned} & w_0^{new}=w_0^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^0 \\ & w_1^{new}=w_1^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^1 \\ & w_2^{new}=w_2^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^2 \\ & \vdots \\ & w_N^{new}=w_N^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^N \\ \end{aligned}

We know that the sum can be expressed as a matrix or a vector, which is also a kind of vectorization, such as ∑ I =1nwix1\sum_{I =1}^nw_ix_1∑ I =1nwix1 can be converted to wTxw^TxwTx. Therefore, for the above equation, we can also vectorize it and get the following results:


w n e w = w o l d + Alpha. x T ( y h w ( x ) ) w^{new}=w^{old}+\alpha x^T(y-h_w(x))

Let’s explain the above formula again. For example, we have 100 data samples, and each sample contains two attributes. Then x represents the whole sample set, and its value is X.shape =(100,2)x.shape=(100,2) X.shape =(100,2). So the xT. Shape = (2100) x ^ T.s hape = (2100) xT. Shape = (2100). And (y – hw (x)). Shape = (100, 1) (y – h_w (x)). The shape = (100, 1) (y – hw (x)). The shape = (100, 1), so you get after you multiply shape is (2, 1) (2, 1) (2, 1), Exactly the same dimension as the required WWW vector.

Based on the vectorization results above, we define a gradient_ascent method to implement this functionality in code.

"" Author: Taoye wechat public id: Coder Explain: Sigmoid function Parameters: in_data: Sigmoid processing input data Return: "" def sigmoID (in_data): return 1 / (1 + np.exp(-in_data)) "" Author: Taoye Cynical Coder Explain: Logistic regression core method, mainly using the gradient rise algorithm Parameters: x_data sample set attributes y_label sample set label Return: weights: weights: "" def gradient_ascent(x_data, y_label): X_data, y_label = np.mat(x_data), np.mat(y_label). Data_number, attr_number = x_data.shape # learning_rate, max_iters, weights = 0.001, 500 Np.ones ([attr_number, 1]) # loss_list = list() for each_iter in range(max_iters): Max_iters sigmoid_result = sigmoid(x_data, Weights)) # sigmoID handle x*w difference = y_label - sigmoid_result # calculate the loss value weights = weights + learning_rate * Np.matmul (x_data.t, difference) # update weight w vector loss = Np.matmul (y_label.t, difference) np.log(sigmoid_result)) + np.matmul((1 - y_label).T, np.log(1 - sigmoid_result)) loss_list.append(loss.tolist()[0][0]) return weights.getA(), loss_listCopy the code

After obtaining the final training parameters of the model, the results can be visualized so as to intuitively feel the classification results of Logistic regression. To do this, define a show_result method to visualize the result:

"" Author: Taoye wechat public number: Cynic Coder Explain: visualized classification results, also known as Logistic regression visualized Parameters: X_data: attributes of the sample set y_label: labels of the sample set weights: parameters required by the model, i.e. weights "" def show_result(x_data, y_label, weights): Weights [1][0], weights[2][0], weights[2][0] Min_x_2 = np.min(x_data, axis = 0)[:-1] Line_x_1 = np.linspace(min_x_1-0.2, max_x_1 + 0.2, Scatter (x_data[:, 0], x_data[:, 1], scatter(x_data[:, 0], x_data[:, 1], Plt.plot (line_x_1, LINe_x_2) # Plot the decision line of classificationCopy the code

Visual classification results are as follows:

The figure above mainly includes two parts, one is the result of Logistic regression classification, the other is the transformation of the lost value after each iteration.

According to the visualization results, the classification effect is quite good, basically all data points can be correctly classified, readers can decide whether to use random seeds according to the need, so as to observe the classification effect of different data sets.

From the second figure, we can see that in the process of Logistic regression training, the value of the loss function keeps increasing. Moreover, we can also see that the slope of its increase gradually decreases, especially during the first few iterations. When the loss function value increases to a certain extent, it is basically saturated, that is to say, the classification process is basically over.

The figure above is the visual result of the value of the loss function. We can also observe the change trend from the specific value after each iteration:

The initial loss value is more than 300. After each iteration, the loss value gradually increases, and the increasing speed keeps decreasing. The final loss value stays at about 3 and reaches saturation. This is the effect reflected by the gradient ascent algorithm, that is to say, the smaller our loss function value, the more obvious the optimization effect of the gradient ascent method will be.

Complete code:

Import numpy as NP """ Author: Taoye wechat 下 载 : X_data: attribute characteristics of the sample y_label: the corresponding label of the attribute characteristics of the sample """ def establish_data(): # np.random. Seed (1) x_data = Np.concatenate ((np.add(NP.random. Randn (50, 2), [1.5, 1.5]), Np.subtract (NP.random.randn (50, 2), [1.5, 1.5])), axis = 0) # Random generate data randomly, X_data = np.concatenate((x_data, np.ones([100, 1]))), Axis = 1) y_label = Np.concatenate ((np.zeros([50]), NP.ones ([50])), Axis = 0) # concatenate return x_data, Y_label "" Author: Taoye wechat public id: Coder Explain: Sigmoid function Parameters: in_data: Sigmoid input data Return: "" def sigmoID (in_data): return 1 / (1 + np.exp(-in_data)) "" Author: Taoye Cynical Coder Explain: Logistic regression core method, mainly using the gradient rise algorithm Parameters: x_data sample set attributes y_label sample set label Return: weights: weights: "" def gradient_ascent(x_data, y_label): X_data, y_label = np.mat(x_data), np.mat(y_label). Data_number, attr_number = x_data.shape # learning_rate, max_iters, weights = 0.001, 500 Np.ones ([attr_number, 1]) # loss_list = list() for each_iter in range(max_iters): Max_iters sigmoid_result = sigmoid(x_data, Weights)) # sigmoID handle x*w difference = y_label - sigmoid_result # calculate the loss value weights = weights + learning_rate * Np.matmul (x_data.t, difference) # update weight w vector loss = Np.matmul (y_label.t, difference) np.log(sigmoid_result)) + np.matmul((1 - y_label).T, np.log(1 - sigmoid_result)) loss_list.append(loss.tolist()[0][0]) return weights.getA(), loss_list """ Author: Taoye wechat public number: Cynical Coder Explain: visual classification results, also known as Logistic regression visual Parameters: X_data: attributes of the sample set y_label: labels of the sample set weights: parameters required by the model, i.e. weights "" def show_result(x_data, y_label, weights): Weights [1][0], weights[2][0], weights[2][0] Min_x_2 = np.min(x_data, axis = 0)[:-1] Line_x_1 = np.linspace(min_x_1-0.2, max_x_1 + 0.2, Scatter (x_data[:, 0], x_data[:, 1], scatter(x_data[:, 0], x_data[:, 1], Plt.plot (line_x_1, line_x_2) # Plot the classification decision line if __name__ == "__main__": x_data, y_label = establish_data() weights, loss_list = gradient_ascent(x_data, y_label) show_result(x_data, y_label, weights) # from matplotlib import pyplot as plt # plt.plot(np.arange(len(loss_result)), loss_list)Copy the code

The above is all the content of Logistic regression in this article. We summarize the realization process of Logistic regression:

Firstly, the loss function L(w)L(w)L(w) L(w) of Logistic regression was obtained through analysis and deduction (maximum likelihood method was used) :


L ( w ) = i = 1 n y i l o g ( h w ( x ) ) + ( 1 y i ) l o g ( 1 h w ( x ) ) L(w)=\sum_{i=1}^ny_ilog(h_w(x))+(1-y_i)log(1-h_w(x))

Secondly, in order to map the result after the inner product of WWW and XXX to the range of 0-1 to reflect the characteristics of probability, we introduce the Sigmoid function to process the inner product result:


h w ( x ) = g ( w T x ) = 1 1 + e w T x h_w(x)=g(w^Tx)=\frac{1}{1+e^{-w^Tx}}

After introducing the Sigmoid function, the loss function is simplified to obtain:


L ( w ) = i = 1 n y i l o g ( h w ( x i ) ) + ( 1 y i ) l o g ( 1 h w ( x i ) ) = i = 1 n [ y i log h w ( x i ) 1 h w ( x i ) + l o g ( 1 h w ( x i ) ) ] = i = 1 n [ y i ( w T x i ) w T x i l o g ( 1 + e w T x i ) ] \begin{aligned} L(w) & =\sum_{i=1}^ny_ilog(h_w(x_i))+(1-y_i)log(1-h_w(x_i)) \\ & = \sum_{i=1}^n[y_i\log\frac{h_w(x_i)}{1-h_w(x_i)}+log(1-h_w(x_i))] \\ & = \sum_{i=1}^n[y_i(w^Tx_i)-w^Tx_i-log(1+e^{-w^Tx_i})] \end{aligned}

Finally, because we need to constantly update and iterate the WWW parameters through the gradient rise algorithm, we need to take the derivative of WWW, and the derivative results are as follows:


partial L ( w ) partial w = i = 1 n ( y i h w ( x i ) ) x i \frac{\partial L(w)}{\partial w}=\sum_{i=1}^n(y_i-h_w(x_i))x_i

In this way, we can optimize WWW parameters step by step through the gradient value. The optimization method is mainly through the gradient rise algorithm:


w n e w = w o l d + Alpha. x T ( y h w ( x ) ) w^{new}=w^{old}+\alpha x^T(y-h_w(x))

That is, each element inside the WWW vector is updated:


w 0 n e w = w 0 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 0 w 1 n e w = w 1 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 1 w 2 n e w = w 2 o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i 2 w N n e w = w N o l d + Alpha. i = 1 n ( y i h w ( x i ) ) x i N \begin{aligned} & w_0^{new}=w_0^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^0 \\ & w_1^{new}=w_1^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^1 \\ & w_2^{new}=w_2^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^2 \\ & \vdots \\ & w_N^{new}=w_N^{old}+\alpha\sum_{i=1}^n(y_i-h_w(x_i))x_i^N \\ \end{aligned}

When the number of iterations reaches a certain level, the WWW vector obtained by the final update iteration is the parameter result obtained by us. According to this parameter, a model can be constructed to separate the data set, thus realizing the classification of data.

Logistic regression mainly uses the gradient rise algorithm. In the actual practice of the above cases, our randomly generated sample data set is not particularly large, so the training speed is quite fast. However, assuming that we have a large number of training samples, the training efficiency will be relatively low at this time. At this time, it may be necessary to optimize the gradient ascent algorithm to some extent, and the commonly used way for this part of optimization is to use the stochastic gradient ascent algorithm. Due to the limitation of space and time, we will have the opportunity to come to liver later.

This is the ninth article in the series of hand tearing machine learning, and it is almost the end. The preliminary plan is to finish it this week, because there are many tasks behind it that have not started yet. Thinking of this, Taoye has an unfeeling corner of her eyes.

Of course, my life is limited, and knowledge is also limited. Learning itself is an endless process, so the machine learning algorithms and the knowledge involved are only the tip of the iceberg in this field. The most important thing for us is to keep a positive heart.

I am Taoye, love study, love to share, is keen on all kinds of technology, the study of anime like playing chess, listening to music, chat, hoping to worlds to record your growth process as well as the life intravenous drip, also hope to be able to strong more within the circle of like-minded friends, more welcome visiting WeChat princess: cynicism Coder.

I’ll see you next time. Bye

References:

[1] Machine Learning In Combat: Peter Harrington, Posts and Telecommunications Press [2] Statistical Learning Methods: Li Hang, 2nd edition, Tsinghua University Press

Recommended reading

Machine Learning in Action — On linear regression — Bayesian in vernacular, “Machine Learning in Action” — Female students asked Taoye, KNN should play how to complete “Machine Learning in Action” — understand all understand, don’t understand also can understand. Nonlinear support vector Machine Machine Learning in Action Machine Learning in Action, Taoye, takes a look at support vector machines Print (“Hello, NumPy! “) ) do what what not, eat the first Taoye penetration into a black platform headquarters, the truth behind the very fear of “Tai Hua database” -SQL statement execution, what did the bottom?