This article originated from personal public account: TechFlow, original is not easy, for attention


In the previous article, we derived the formula of linear regression, the essence of linear regression is linear function, the principle of the model is not difficult, the core is the process of solving model parameters. Through the derivation and learning of linear regression, we basically understand the process of machine learning model learning, which is the essence of machine learning, much more important than the principle of a single model.

For those of you who are new and forgotten, please click on the link below to review the previous linear regression and gradient descent.

In this paper, the penetration gradient descent method is discussed

Develop linear regression model in detail


Regression and classification


In machine learning, models are divided into two types based on the predicted results. If we want the model to predict one or more consecutive values, this type of problem is called regression problem. Such as common future stock price estimation, future temperature estimation and so on are regression problems. There is also a classification problem, where the prediction result of the model is a discrete value, that is, there are only fixed kinds of results. Common spam identification, web page image yellow and so on.

As the name implies, the logistic regression we introduced before is a regression problem. Today’s article is about how to transform the regression model into a classification model, which is called logistic regression derived from linear regression.


Logistic regression


Logistic regression is a magical model. Although it is also regression in nature, it is a classification model, and its name contains the word “regression”, which is rather puzzling.

If you are a beginner, it is normal to feel dizzy, it doesn’t matter, let’s clarify a little bit.

Let’s go back to linear regression, as we all know, linear regression. We can figure out what y is for X based on W and b, where y is a continuous value, and it’s a regression model. But what if we want this model to do categorization, what do we do? It’s easy to imagine that we could set the threshold artificially, right, by saying that the last category that y is greater than 0 is 1, and the last category that y is less than 0 is 0. On the surface, this is certainly possible, but in practice there are many problems with doing so.

The biggest problem is that if we simply design a threshold for judgment, then leads to the final y is a piecewise function, and discontinuous piecewise function, gradient that we have no way to it, in order to solve this problem, we have to find a smooth function can be used for classification, and can solve the problem of gradient.

Soon, information scientists found such a function, the Sigmoid function, whose expression is:


Its function graph is as follows:

As you can see, the sigmoid function values 0.5 at x=0, the limit is 1 at positive infinity, and the limit is 0 at negative infinity, and the function is continuous and differentiable everywhere. The sigmoID function values range from 0 to 1, which is very suitable for reflecting the probability of something happening. We thinkThat’s the probability that x happens, so the probability that x doesn’t happen is equal to. If we think of occurrence and non-occurrence as two categories, then the sigmoid function is transformed into a classification function, ifRepresents category 1, otherwise represents category 0.

So it’s pretty easy to get to this point, and by linear regression we get. In other words, we put a layer of SigmoID function on the outside of the linear regression model. By calculating different y, we can obtain different probabilities and finally get different classification results.


Loss function


The derivation below is full of high energy, I believe you will read three consecutive (like, forward, follow).

So let’s get started, let’s just define the notation, and to distinguish it, let’s call the actual classification of the training sample.The matrix of. Once again, let’s write a single sample.The matrix of. The result of a single prediction is written, all the predicted results are written.

For a single sample, y has two values, it could be 1, it could be 0, 1 and 0 represent two different categories. We hope toThe time,As far as possible big,When,As big as possibleAs small as possible, because it’s between 0 and 1. Let’s use a formula to unify the two cases:


So let’s plug in,When the former term is 1, the expression is left with only the latter term. Similarly,, the latter term is 1, leaving only the former term. So this is going to represent the probability of getting it right, and we want that probability to be as high as possible. Obviously,So we can take the log of it, because the log function is monotonic. soI’m going to take the maximum value, and I’m going to take the maximum valueTake the highest value.


We expect this to be the largest, which means we expect it to have the smallest negative, so let’s say, thus its loss function can be obtained:


And if you know the concept of cross entropy, you’ll see that this loss function is actually expressed as cross entropy. Cross entropy is used to measure the “distance” between two probability distributions. The smaller the cross entropy is, the closer the two probability distributions are, so it is often used as the loss function of classification models. The concept of cross entropy is not discussed here, but will be introduced in detail in later articles. It is no coincidence that the loss function we deduce conveniently is the cross entropy. In fact, there is a set of mathematical logic supported by information theory at the bottom. We will not do any extension, so those who are interested can understand it.


Hardcore is derived


Now that we have the loss function, we just have to find the gradient to do gradient descent.

This function looks very complex, it is too hard to directly seek partial derivative (dangerous), if it is a long time do not touch the high number of students directly liver no less hard anti reed name one heart.

To simplify things, let’s do some preparatory work. First, let’s take a lookThe delta function, which itself has a very complicated form, let’s just take care of its derivatives.


because, we plug it into the loss function, and we get, where:


And then we figure outrightThe partial of PI, I’m going to plug in here for PIConclusion of derivation:



The code field


The gradient formula has been derived, is it far from writing code to implement?

But you can’t make bricks without straw, so before we roll out the model, let’s try to build a bunch of data.

We choose a very simple scene in life — the exam. Suppose each student has to take two tests, and the results of the two tests are added together to get the final score. We have data on whether a batch of students are qualified. We hope to design a logistic regression model to help us directly calculate whether students are qualified or not.

In order to prevent the sigmoID function from producing deviations, we scaled the grades of each course to the range of (0, 1). A combined score of 140 or more is considered an overall pass.

import numpy as np
def load_data(a):
    X1 = np.random.rand(50.1) /2 + 0.5
    X2 = np.random.rand(50.1) /2 + 0.5
    y = X1 + X2
    Y = y > 1.4
    Y = Y + 0
    x = np.c_[np.ones(len(Y)), X1, X2]
    return x, Y
Copy the code

The training data thus obtained has two features, namely, students’ scores of two courses and an offset 1, which is used to record the offset of constants.

Then, according to the above formula, it is not difficult (really not difficult) to implement sigmoID and gradient descent functions.

def sigmoid(x):
    return 1.0/ (1+np.exp(-x))
    
def gradAscent(data, classLabels):
    X = np.mat(data)
    Y = np.mat(classLabels)
    m, n = np.shape(dataMat)
    Vector #
    alpha = 0.01
    iterations = 10000
    wei = np.ones((n, 1))
    for i in range(iterations):
        y_hat = sigmoid(X * wei)
        error = Y - y_hat
        wei = wei + alpha * X.transpose()*error
    return wei
Copy the code

This function is a batch gradient descent, Numpy familiar students can see, this is a direct formula.

Finally, we plotted the data set and the logistic regression line.

import matplotlib.pyplot as plt
def plotPic(a):
    dataMat, labelMat = load_data()
    # Get gradient
    weis = gradAscent(dataMat, labelMat).getA()
    n = np.shape(dataMat)[0]
    xc1 = []
    yc1 = []
    xc2 = []
    yc2 = []
    # Data is divided into positive and negative categories
    for i in range(n):
        if int(labelMat[i]) == 1:
            xc1.append(dataMat[i][1])
            yc1.append(dataMat[i][2])
        else:
            xc2.append(dataMat[i][1])
            yc2.append(dataMat[i][2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xc1, yc1, s=30, c='red', marker='s')
    ax.scatter(xc2, yc2, s=30, c='green')
    # Draw the dividing line
    x = np.arange(0.5.1.0.1)
    y = (-weis[0]- weis[1]*x)/weis[2]
    ax.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.show()
Copy the code

The final result is as follows:


Random gradient descent version


It can be found that after 10,000 iterations, the model we obtained can correctly identify all samples.

We just implemented a full gradient descent algorithm, and we can also use stochastic gradient descent for optimization. Optimization is also very simple. Instead of calculating gradients for all data, we choose one from the data set to calculate gradients.

Basically, you can reuse the gradient descent code, just add optimization to the selected part of the sample.

def stocGradAscent(data, classLab, iter_cnt):
    X = np.mat(data)
    Y = np.mat(classLab)
    m, n = np.shape(dataMat)
    weis = np.ones((n, 1))
    for j in range(iter_cnt):
        for i in range(m):
            # Use inverse proportional function to optimize learning rate
            alpha = 4 / (1.0 + j + i) + 0.01
            randIdx = int(np.random.uniform(0. m)) y_hat = sigmoid(np.sum(X[randIdx] * weis)) error = Y[randIdx] - y_hat weis = weis + alpha * X[randIdx].transpose()*errorreturn weis
Copy the code

We set the number of iterations to 2000, and the final separated image results are as follows:

Of course, the above code is not perfect, it is a simple demo, and there is a lot of room for improvement and optimization. Just as an example, just to give you an intuition: it’s not hard to write your own model, and it’s fun to derive the formula. That’s why I set up a high math topic. A lot of CS knowledge is also thought out, in the process of learning inspiration is really very fun, I hope everyone can find their own fun.

That’s all for today’s article. If you feel you have gained something, please scan the code and pay attention to it. Your contribution is very important to me.