Follow the public account “Seaside Pickers” and choose the “Star Standard” public account

Get value first

In typical classification algorithms, it is generally supervised learning, and the training samples contain the characteristics and label information of the samples. In dichotomies, labels are discrete values, such as {-1,+1}, representing the negative and positive classes, respectively. The classification algorithm obtains the mapping relationship from sample features to sample labels by learning the training samples, which is also known as the hypothesis function, which can be used to predict the category of new samples.

Logistic Regression

For a classification problem, there are usually two cases, linearly separable and linearly indivisible. The Logistic Regression model is one of the generalized linear models. In general cases of linear separability, a hyperplane function representing linearity can be found. Take the data with only one dimension as an example, its linearly separable hyperplane form is:

                                                    

Where W is the weight and b is the bias. In the case of multidimensional, both should be expressed as vectors. In this algorithm, the hyperplane is finally obtained by learning the training samples, and the data is divided into two different categories. Here, threshold functions can be introduced to map samples to different categories. The most common functions take Sigmoid function as an example, in the following form:

                                              

The function is visualized as follows:

                 

As can be seen from the image, the range of this function is (0,1), and the change near 0 is obvious. Its derivative function is:

                              

Now let’s implement this function in code:

Import numpy as np def sigmoid(x): return 1.0 / (1 + np.exp(-x))Copy the code

For input vector X, the probability that it belongs to the positive class is:

                         

Where ɕ stands for Sigmoid function, then the probability of input vector X belonging to negative class is:

                       

Loss function

For Logistic Regression, the probability expression function can be obtained by observing the category probability of input vector X above:

                         

Among them, W and B are difficult to solve in the above function form, so they need to be estimated by maximum likelihood method first and converted into convex function (here is the theory of convex optimization), that is, there is an optimal solution for W and B, which is also easy to solve with a relatively simple method. After taking negative Log of its likelihood function, the function form is:

  

Where h is the value of the input vector after the Sigmoid function is passed in, so we only need to find W and b that minimize lw and b.

Gradient descent method

For the problem of minimizing the loss function, the gradient descent method in the iterative method can be used to solve the problem. Its advantage is that it only needs to solve the first derivative of the loss function, and the computational cost is less than that of Newton’s method, which makes it widely used in large-scale data sets. The specific principle is to select the descending direction according to the initial point in the process of each iteration, and then change the parameters to be modified.

The gradient expression of the two variables is:

                                

                                  

Where B can be regarded as the first component of W, and its updating formula is:

                                                

Now you can implement the training process in code:

Def lr_gd(feature, label, maxCycle, alpha): "input: feature feature label maxCycle maximum number of iterations alpha learning rate output: W weight n = np.shape(feature)[1] w = Np.mat (nP.ones ((n, 1)) I = 0 while I <= maxCycle: i += 1 h = sigmoid(feature * w) err = label - h if i % 100 == 0: print("\t---------iter=" + str(i) + \ " , train error rate= " + str(error(h, label))) w = w + alpha * feature.T * err return wCopy the code

Where, the error function is error, because each iteration needs to calculate the error of the current model, and the specific implementation is as follows:

Def error(h, label): input: h predicted value label Actual value Output: Err /m error rate m = Np.shape (h)[0] sum_err = 0.0 for I in range(m): if h[i, 0] > 0 and (1 - h[i, 0]) > 0: sum_err -= (label[i,0] * np.log(h[i,0]) + \ (1-label[i,0]) * np.log(1-h[i,0])) else: sum_err -= 0 return sum_err / mCopy the code

Here the whole process is basically over ~

For more wonderful recommendations, please follow us