directory

Gradient descent scenario hypothesis

Gradient descent

differential

The gradient

Mathematical explanation of gradient descent algorithm

Example of gradient descent algorithm

Gradient descent of a function of one variable

Gradient descent of a multivariable function

Implementation of gradient descent algorithm

coding time

summary

Further reading


Reprint the address can be combined with my blog Numpy gradient download implementation contrast look

  • Gradient descent scenario hypothesis
  • The gradient
  • Mathematical explanation of gradient descent algorithm
  • Example of gradient descent algorithm
  • Implementation of gradient descent algorithm
  • Further reading

This paper will start from a downhill scene, first put forward the basic idea of gradient descent algorithm, and then mathematically explain the principle of gradient descent algorithm, finally achieve a simple example of gradient descent algorithm!

Gradient descent scenario hypothesis

The basic idea of gradient descent can be likened to a descent. Consider a scenario where a person is stuck on a mountain and needs to get down (i.e Find the lowest point of the mountain, the valley). But at this time the mountain fog is very heavy, resulting in poor visibility. So the path down the mountain is uncertain, and he must use the information around him to find his way down. At that point, he can use a gradient descent algorithm to help him down the mountain. Specifically, look for the steepest part of the mountain based on his current position, and then go down the mountain. Similarly, if our goal is to go up the mountain, that is, to climb to the top, then we should go up the steepest part. And then you do the same thing over and over again for each distance, and you get to the valley.

 

image.png

 

We can also assume that the steepest part of the mountain is not immediately visible to the naked eye, but requires a sophisticated instrument to measure it, and that this person has the ability to measure the steepest direction at this point. So for every distance the person traveled, it took a while to measure the steepest direction of his position, which was time consuming. So in order to get to the bottom of the mountain before the sun goes down, you need to take as few measurements as possible. This is a dilemma. If you make frequent measurements, you can ensure that the descent direction is absolutely correct, but it is very time-consuming, and if you make too few measurements, you risk going off track. So you need to find the right frequency to measure the direction of the mountain, to make sure that the direction is not wrong, and at the same time not too time-consuming!

Gradient descent

The basic process of gradient descent is very similar to the descent scene.


First of all, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum value of this function, which is the bottom of the mountain. According to the previous scenario, the fastest way down the hill is to find the steepest direction of the current position, and then go down in that direction, corresponding to the function, which is to find the gradient at a given point, and then go in the opposite direction of the gradient, and the function will fall the fastest! Since the direction of the gradient is the direction in which the function changes most rapidly (explained in more detail later), we repeat this method, taking the gradient over and over again, and finally reaching a local minimum, similar to the way we go downhill. Taking the gradient determines the steepest direction, which is the way to measure direction in a scene. So why is the gradient going in the steepest direction? Now, let’s start with differentials

differential

There are different ways of looking at what differentials mean, and the two most common ones are:

  • The slope of the tangent line at a point on the graph of the function

  • The rate of change of a function several examples of differentiation:

     

    image.png

The above examples are all single-variable differentials. When a function has more than one variable, it has multi-variable differentials, that is, differentiating each variable separately

 

image.png

The gradient

Gradient is really just a generalization of multivariable differentiation. Here’s an example:

 

image.png

As we can see, a gradient is a differential for each variable, separated by a comma, and a gradient is enclosed with <> to show that the gradient is actually a vector.

Gradient is a very important concept in calculus, and we talked about gradient

  • In a function of one variable, the gradient is simply the derivative of the function, representing the slope of the tangent line of the function at a given point
  • In a multivariable function, a gradient is a vector that has a direction, and the direction of the gradient indicates the direction in which the function can rise fastest at a given point

That’s why we need to do everything we can to find the gradient! We need to get to the bottom of the mountain, and we need to look at the steepest part at each step, and the gradient tells us exactly in that direction. The direction of the gradient is the direction in which the function rises fastest at a given point, so the opposite direction of the gradient is the direction in which the function falls fastest at a given point, which is exactly what we need. So all we have to do is follow the gradient all the way to the local lowest point!

 

image.png

Mathematical explanation of gradient descent algorithm

Above we have spent a lot of space introducing the basic idea and scenario hypothesis of gradient descent algorithm, as well as the concept and idea of gradient. Here we begin to mathematically explain the process and ideas of the gradient descent algorithm!

 

image.png

 

The meaning of this formula is: J is a function of θ, we are at θ 0, from this point to the minimum point of J, that is, the bottom of the mountain. First we determine the direction of travel, which is the reverse of the gradient. Then we take a distance step, alpha, and reach θ 1!

 

image.png

Here are a few common questions about this formula:

  • What does alpha mean? Alpha is called the learning rate or step length in gradient descent algorithms, which means that we can control the distance of each step by alpha to make sure that we don’t go too far and miss the bottom. Also make sure not to walk too slowly, so that the sun goes down, has not gone down the hill. So the choice of alpha is often important in gradient descent! Alpha can not be too large or too small, too small, may lead to slow to the lowest point, too large, will lead to miss the lowest point!

image.png

  • Why do I multiply the gradient by a minus sign? A minus sign in front of the gradient means going in the opposite direction of the gradient! As we mentioned earlier, the direction of the gradient is actually the direction in which the function rises fastest at this point! And we want to go in the direction of the fastest descent, which of course is the direction of the negative gradient, so we have to put a minus sign here

Example of gradient descent algorithm

Now that we have a basic understanding of how gradient descent works, let’s look at a few small examples of gradient descent, starting with functions of one variable

Gradient descent of a function of one variable

Let’s say I have a function of one variable

 

image.png

 

Differential of a function

 

image.png

Initialization, starting from

image.png

 

Vector for

 

image.png

According to the formula for gradient descent

image.png

We started the iterative calculation process of gradient descent:

image.png

And here, after four operations, or four steps, you basically reach the lowest point of the function, which is the bottom of the hill

image.png

Gradient descent of a multivariable function

Let’s say we have a target function

 

image.png

 

Now we’re going to minimize this function by gradient descent. We can see by looking at it that the minimum is actually the point 0,0. But then, we’ll start with the gradient descent algorithm and work our way up to this minimum! We assume that the initial starting point is:

 

image.png

The initial learning rate is:

image.png

 

The gradient of the function is:

 

image.png

Do multiple iterations:

image.png

We find that we’re almost at the minimum point of the function

image.png

Implementation of gradient descent algorithm

Let’s implement a simple gradient descent algorithm in Python. The scenario is a simple example of linear regression: suppose we now have a series of points, as shown below

image.png

 

We’re going to use gradient descent to fit this line!

First, we need to define a cost function, here we use the mean square error cost function

image.png

 

In the public

  • M is the number of data concentration points

  • ½ is a constant, so that when you take the gradient, the second power cancels out ½, so you don’t have any extra constants, and it doesn’t affect the result

  • Y is the true y-coordinate value of each point in the data set

  • H is our prediction function. According to each input x, the predicted y value is calculated according to θ, i.e

     

    image.png

We can see from the cost function that there are two variables in the cost function, so it is a multi-variable gradient descent problem. To solve the gradient of the cost function, that is, to differentiate the two variables respectively

 

image.png

The cost function and gradient, as well as the function form of prediction are defined. We’re ready to start coding. But before we do that, it is important to note that, for the sake of writing code, we will convert all formulas into matrices. It is very easy to calculate matrices in Python, and the code will be very concise.

To translate into a matrix calculation, we observe the form of the prediction function

 

image.png

 

We have two variables. To matrix this formula, we can add one dimension to each point x, which is fixed at 1. This dimension will be multiplied by θ 0. This is convenient for us to unify the matrix calculation

 

image.png

Then we convert the cost function and the gradient into matrix vector multiplication

 

image.png

coding time

First, we need to define the data set and the learning rate

import numpy as np # Size of the points dataset. m = 20 # Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 0]).0 0 (m, 1) # The Learning Rate alpha. Alpha = 0Copy the code

Next we define the cost function and the gradient of the cost function in terms of matrix vectors

def error_function(theta, X, y):
    '''Error function J definition.'''
    diff = np.dot(X, theta) - y
    return (1./2*m) * np.dot(np.transpose(diff), diff)

def gradient_function(theta, X, y):
    '''Gradient of the function J definition.'''
    diff = np.dot(X, theta) - y
    return (1./m) * np.dot(np.transpose(X), diff)
Copy the code

Finally, the core part of the algorithm is gradient descent iteration

def gradient_descent(X, y, alpha):
    '''Perform gradient descent.'''
    theta = np.array([1, 1]).reshape(2, 1)
    gradient = gradient_function(theta, X, y)
    while not np.all(np.absolute(gradient) <= 1e-5):
        theta = theta - alpha * gradient
        gradient = gradient_function(theta, X, y)
    return theta
Copy the code

When the gradient is less than 1E-5, it indicates that it has entered a relatively smooth state, similar to the state of the valley. At this time, the effect of continuous iteration is not much, so you can exit the cycle at this time!

The complete code is shown below

import numpy as np # Size of the points dataset. m = 20 # Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 0] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 '''Error function J definition.''' diff = np.dot(X, theta) - y return (1./2*m) * np.dot(np.transpose(diff), diff) def gradient_function(theta, X, y): '''Gradient of the function J definition.''' diff = np.dot(X, theta) - y return (1./m) * np.dot(np.transpose(X), diff) def gradient_descent(X, y, alpha): '''Perform gradient descent.''' theta = np.array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, y) while not np.all(np.absolute(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, y) return theta optimal = gradient_descent(X, Y, alpha) print('optimal:', optimal) print('error function:', error_function(optimal, X, y)[0,0])Copy the code

Run the code and the result is as follows

 

image.png

The line fitted is as follows

 

image.png

summary

At this point, we have basically introduced the basic idea of gradient descent method and algorithm flow, and use Python to achieve a simple gradient descent algorithm fitting straight line case! Finally, we went back to the beginning of the proposed scenario assumption: the mountain people actually represents the back propagation algorithm, the mountain path is actually Θ has been looking for parameters in the algorithm, the current point of the steepest is actually in the direction of the cost function in this gradient direction, scene observation tool is used in the direction of the steepest differential. The time before the next observation is defined by the learning rate alpha in our algorithm. You can see that the scenario hypothesis and the gradient descent algorithm do a good job of matching!

Further reading

  • Gradient Descent lecture notes from UD262 Udacity Georgia Tech ML Course.
  • An overview of gradient descent optimization algorithms.