The formula works better on my website, lulaoshi.info/machine-lea… Welcome to visit.

In the previous section, we described the mathematical representation of linear regression and finally concluded that the machine learning process of linear regression is an optimization problem solving process that minimizes the loss function.

Normal Equation

For the loss function, its derivative can be zero to find the extreme point of the loss function.

Unary linear regression

Let’s say our model only has one-dimensional data, and our model is a straight lineWe shareThe loss function is the average of the sum of squares of errors:


Can beandWhen the derivative is 0, the loss function is minimal.



The above two formulas are loss functionsrightandLet’s take the partial derivative. When the derivative is 0, the minimum value of the loss function can be obtained, that is, the optimal solution can be obtained from the above two formulasand.



The optimal solution is:



Among them,, that is, toThe average.

The above is the least square method of linear regression solution process. Many machine learning models need to go through the above process: determine the loss function, find the parameters to minimize the loss function. The solution involves some simple calculus, so a review of partial derivatives of calculus will help you understand the mathematics of machine learning.

Multiple linear regression

More generally, features are multidimensional:


In the above formula, we are actually usingTo represent the parametersThat will beAdd to the eigenvector, willThe dimensional feature expands intoDimension, that is, in eachAdd an item with a value of 1 to Each vector and matrix shape is shown in the formula below.


Where, the vectorRepresents the weight of each feature in the model; matrixEach row of PI is a sample, and each sample has aAre the values of the sample in different dimensions; vectorIs the true value. Summation terms can be expressed as inner products:. It can be expressed in the form of matrix multiplication as:.


Generally prefer to use machine learning field matrix multiplication represented in the form of a model, not only because it can easily be said, this is because modern computers for vector calculation done a lot of optimization, both in the CPU and GPU like vector computation, parallel processing data, can get the acceleration of hundreds of thousands of times more than.

Note that in the formula, the bold ones represent scalars and the bold ones represent vectors or matrices.

More complex than unary linear regression, the optimal solution of multiple linear regression is not a straight line, but a hyperplane in multi-dimensional space, and the training data are scattered on both sides of the hyperplane.

Multiple linear regression generally seeks the optimal hyperplane


The loss function of multiple linear regression still uses the square of “predicted value – true value” to calculate, and the above formula is the vector representation of the loss function of the whole model. There’s a vertical part here, and it’s called the square of the L2 norm. Norm is usually for vectors, and is a mathematical notation often used in machine learning. The following formula shows a vectorThe squared of the L2 norm of phi and its derivative.


The vector in the linear regression loss function formulaTake the derivative and set the derivative to zero:




The top formula is the vector 1, 2This is a matrix equation. If you use this method to find the optimal solution, you’re essentially solving this matrix Equation, which is called the Normal Equation.

There’s a problem with this method, and I’m sure we’ve talked about it in linear algebra,The equations can be solved only when it is full-rank or Positive Definite. What exactly does “full rank” or “positive definite” mean? In layman’s terms, the data in the sample must be rich enough and representative enough for the matrix equation to have a unique solution, otherwise the matrix equation will have multiple sets of solutions. If the feature has upper dimensions, but only dozens of samples to train, we can hardly get a satisfactory optimal solution.

There is another problem with the above method: the calculation of matrix inverse in the formula is relatively large, and the complexity is inLevel. When the characteristic dimension reaches millions or more or the number of samples is very large, the calculation time is very long, and the memory of a single computer can not even store these parameters, so the solution of matrix equation is not practical.

I spent some time describing the solving process of linear regression before, and many formulas appeared. Like many friends, THE author used to hate to read formulas, and when he saw some formulas, his head became very big, so he found machine learning very difficult. However, if you calm down and read it carefully, you will find that in fact, these formulas are used in calculus, linear algebra in the relatively basic part, do not need to lofty knowledge, science and engineering background friends should be able to understand. In addition, a review of matrices and derivatives will help us understand some of the mathematical principles of deep learning.

Gradient descent method

When solving the problem of minimizing loss function, or the optimization problem of minimizing loss function, search method is often used. Specifically, an initial point is selected as the starting point and continuous search is started. The loss function becomes smaller and smaller gradually. When the end condition of search iteration is reached, this position is the final result of the search algorithm. Let’s make a random guessAnd then toThe values are constantly being adjusted to allowGradually smaller, preferably to find theThe smallest.

In particular, we can consider Gradient Descent, which is a process called Descent from oneStart with the initial value of, and then gradually update the weight, or overwrite the original value with the newly calculated value each time:


Here,Also known as the learning rate,It’s Gradient. And we learned in calculus that at some point, the function changes most rapidly along the gradient. Because we want to minimize the loss functionSo every time we go down the gradient, we go downDrop the fastest direction of movement.

Visually, the process of loss function descending along gradient is shown as follows. The iterative process eventually converges near the minimum, where the gradient or derivative approaches zero.

Return to learning rateOn,It’s how confident we are about the gradient at some point. In general,.The bigger we are, the faster we want the loss function to fall,Smaller means we want the loss function to decline more slowly. ifSet improperly, each step is too large, and the loss function is likely to fail to converge quickly to the minimum. As shown below, the loss function is difficult to converge to the minimum after a long time. In practice,Often changes with the number of iterations, for example, larger at initialization and smaller later.

As we mentioned earlier,It’s a vector, let’s say it’s a vectorDimensional, in the updateWhen, we are to pair at the same timeD allValue is updated, where theThe dimensions update the formula using the weights above.

Next, we simply derive the gradient formula. First, consider that there is only one training sampleIn the case. by, among them,Is a constant term, does not affect the value of the optimal solution, mainly for the convenience of the derivation. We can get:


For a single training sample, the gradient updating rule is as follows:


This rule has several properties that seem natural and intuitive:

  • Update the size withIn proportion.
  • When we come to the predicted value of the training samplewithWhen the true value of is very close to, there is little need to modify the parameter; The opposite of this is if our predicted valueLarge deviations from the true values (such as extreme distances) may require further adjustment of the parameters. This is also consistent with the gradient descent dynamic diagram shown above.

Batch gradient descent method

When there is only one training sample, we derive the LMS rule. When a training set hasAt the time of the training sample,. In the derivation, only the data of multiple training samples need to be added.



So we can get each of theseThe derivative of:


Specifically, the algorithm is:


This method, also known as Batch Gradient Descent (BGD), uses all samples from the entire training set to update parameters at each iteration. The loss function of linear regressionIs a Convex Quadratic Function, the local minimum of the Convex Function is the global minimum, and the optimization problem of linear regression has only one global solution. That is to say, suppose not to put the learning rateIf the setting is too large and the number of iterations is enough, the gradient descent method always converges to the global minimum.

Stochastic gradient descent

Batch gradient descent takes all samples into account when updating parameters. When there is a large amount of data and many features, it is unrealistic to use full data in each iteration. Moreover, full data itself contains a lot of redundant information. The larger the amount of data, the more redundant information, and the redundant information is not very helpful in finding the optimal solution. A compromise is to take only a random sample each time the parameters are updated. An extreme case is that a sample is randomly selected for each iteration and a single sample is used to update the parameters of the iteration. This algorithm is called Stochastic gradient Descent (SGD), as shown below:


In addition, we can also randomly select a mini-batch of training data each time and update the iteration parameters with this batch of data. This algorithm is called Mini-Batch SGD. Mini-batch SGD is a compromise between BGD and SGD. Mini-batch SGD reduces the noise generated by randomness in SGD while being more efficient than BGD.

The gradient descent method tries to approximate the optimal solution, and the solution speed is advantageous when there is a large amount of data, but the absolute optimal solution may not be obtained. In many practical applications, although the solution point of gradient descent is near the optimal point, it can actually meet the requirements. Considering these factors, gradient descent method, especially stochastic gradient descent method, is widely used in machine learning model solving. There are many variations of gradient descent in addition to the ones described above.

NumPy implementation of gradient descent method

Talk is cheap, Show some code. Next, we use NumPy to implement a linear regression model using batch gradient descent and random gradient descent respectively. During the implementation, we will find that there are some engineering problems that are not mentioned in the formula derivation. For example, the data in the calculation is too large to be represented by float64. Engineering implementation represents the difference between theory and practice, and in fact, it is often these engineering details that determine the ease of use of machine learning frameworks.

import numpy as np

class LinearRegression:

    def __init__(self):
        # the weight vector
        self.W = None

    def train(self, X, y, method='bgd', learning_rate=1e-2, num_iters=100, verbose=False):
        """ Train linear regression using batch gradient descent or stochastic gradient descent Parameters ---------- X: training data, shape (num_of_samples x num_of_features), num_of_samples rows of training sample, each training sample has num_of_features-dimension features. y: target, shape (num_of_samples, 1). method: (string) 'bgd' for Batch Gradient Descent or 'sgd' for Stochastic Gradient Descent learning_rate: (float) learning rate or alpha num_iters: (integer) number of steps to iterate for optimization verbose: (boolean) if True, print out the progress Returns ------- losses_history: (list) of losses at each training iteration """
        num_of_samples, num_of_features = X.shape

        if self.W is None:
            # initilize weights with values
            # shape (num_of_features, 1)
            self.W = np.random.randn(num_of_features, 1) * 0.001
        losses_history = []

        for i in range(num_iters):

            if method == 'sgd':
                # randomly choose a sample
                idx = np.random.choice(num_of_samples)
                loss, grad = self.loss_and_gradient(X[idx, np.newaxis], y[idx, np.newaxis])
            else:
                loss, grad = self.loss_and_gradient(X, y)
            losses_history.append(loss)

            # Update weights using matrix computing (vectorized)
            self.W -= learning_rate * grad

            if verbose and i % (num_iters / 10) = =0:
                print('iteration %d / %d : loss %f' %(i, num_iters, loss))
        return losses_history


    def predict(self, X):
        """ Predict value of y using trained weights Parameters ---------- X: predict data, shape (num_of_samples x num_of_features), each row is a sample with num_of_features-dimension features. Returns ------- pred_ys: predicted data, shape (num_of_samples, 1) """
        pred_ys = X.dot(self.W)
        return pred_ys


    def loss_and_gradient(self, X, y, vectorized=True):
        """ Compute the loss and gradients Parameters ---------- The same as self.train function Returns ------- tuple of two items (loss, gradient) loss: (float) gradient: (array) with respect to self.W """
        if vectorized:
            return linear_loss_grad_vectorized(self.W, X, y)
        else:
            return linear_loss_grad_for_loop(self.W, X, y)


def linear_loss_grad_vectorized(W, X, y):
    """ Compute the loss and gradients with weights, vectorized version """
    # vectorized implementation 
    num_of_samples = X.shape[0]
    # (num_of_samples, num_of_features) * (num_of_features, 1)
    f_mat = X.dot(W)

    # (num_of_samples, 1) - (num_of_samples, 1)
    diff = f_mat - y 
    loss = 1.0 / 2 * np.sum(diff * diff)
    
    # {(num_of_samples, 1).T dot (num_of_samples, num_of_features)}.T
    gradient = ((diff.T).dot(X)).T

    return (loss, gradient)


def linear_loss_grad_for_loop(W, X, y):
    """ Compute the loss and gradients with weights, for loop version """
    
    # num_of_samples rows of training data
    num_of_samples = X.shape[0]
    
    # num_of_samples columns of features
    num_of_features = X.shape[1]
    
    loss = 0
    
    # shape (num_of_features, 1) same with W
    gradient = np.zeros_like(W) 
    
    for i in range(num_of_samples):
        X_i = X[i, :] # i-th sample from training data
        f = 0
        for j in range(num_of_features):
            f += X_i[j] * W[j, 0]
        diff = f - y[i, 0]
        loss += np.power(diff, 2)
        for j in range(num_of_features):
            gradient[j, 0] += diff * X_i[j]
            
    loss = 1.0 / 2 * loss

    return (loss, gradient)
Copy the code

The resources

  1. Andrew Ng: CS229 Lecture Notes
  2. Zhou Zhihua: Machine Learning
  3. Houxianxu. Making. IO / 2015/03/31 /…