directory
Gradient descent scenario hypothesis
Gradient descent
differential
The gradient
Mathematical explanation of gradient descent algorithm
Example of gradient descent algorithm
Gradient descent of a function of one variable
Gradient descent of a multivariable function
Implementation of gradient descent algorithm
coding time
summary
Further reading
Reprint the address can be combined with my blog Numpy gradient download implementation contrast look
- Gradient descent scenario hypothesis
- The gradient
- Mathematical explanation of gradient descent algorithm
- Example of gradient descent algorithm
- Implementation of gradient descent algorithm
- Further reading
This paper will start from a downhill scene, first put forward the basic idea of gradient descent algorithm, and then mathematically explain the principle of gradient descent algorithm, finally achieve a simple example of gradient descent algorithm!
Gradient descent scenario hypothesis
The basic idea of gradient descent can be likened to a descent. Consider a scenario where a person is stuck on a mountain and needs to get down (i.e Find the lowest point of the mountain, the valley). But at this time the mountain fog is very heavy, resulting in poor visibility. So the path down the mountain is uncertain, and he must use the information around him to find his way down. At that point, he can use a gradient descent algorithm to help him down the mountain. Specifically, look for the steepest part of the mountain based on his current position, and then go down the mountain. Similarly, if our goal is to go up the mountain, that is, to climb to the top, then we should go up the steepest part. And then you do the same thing over and over again for each distance, and you get to the valley.
image.png
We can also assume that the steepest part of the mountain is not immediately visible to the naked eye, but requires a sophisticated instrument to measure it, and that this person has the ability to measure the steepest direction at this point. So for every distance the person traveled, it took a while to measure the steepest direction of his position, which was time consuming. So in order to get to the bottom of the mountain before the sun goes down, you need to take as few measurements as possible. This is a dilemma. If you make frequent measurements, you can ensure that the descent direction is absolutely correct, but it is very time-consuming, and if you make too few measurements, you risk going off track. So you need to find the right frequency to measure the direction of the mountain, to make sure that the direction is not wrong, and at the same time not too time-consuming!
Gradient descent
The basic process of gradient descent is very similar to the descent scene.
First of all, we have a differentiable function. This function represents a mountain. Our goal is to find the minimum value of this function, which is the bottom of the mountain. According to the previous scenario, the fastest way down the hill is to find the steepest direction of the current position, and then go down in that direction, corresponding to the function, which is to find the gradient at a given point, and then go in the opposite direction of the gradient, and the function will fall the fastest! Since the direction of the gradient is the direction in which the function changes most rapidly (explained in more detail later), we repeat this method, taking the gradient over and over again, and finally reaching a local minimum, similar to the way we go downhill. Taking the gradient determines the steepest direction, which is the way to measure direction in a scene. So why is the gradient going in the steepest direction? Now, let’s start with differentials
differential
There are different ways of looking at what differentials mean, and the two most common ones are:
-
The slope of the tangent line at a point on the graph of the function
-
The rate of change of a function several examples of differentiation:
image.png
The above examples are all single-variable differentials. When a function has more than one variable, it has multi-variable differentials, that is, differentiating each variable separately
image.png
The gradient
Gradient is really just a generalization of multivariable differentiation. Here’s an example:
image.png
As we can see, a gradient is a differential for each variable, separated by a comma, and a gradient is enclosed with <> to show that the gradient is actually a vector.
Gradient is a very important concept in calculus, and we talked about gradient
- In a function of one variable, the gradient is simply the derivative of the function, representing the slope of the tangent line of the function at a given point
- In a multivariable function, a gradient is a vector that has a direction, and the direction of the gradient indicates the direction in which the function can rise fastest at a given point
That’s why we need to do everything we can to find the gradient! We need to get to the bottom of the mountain, and we need to look at the steepest part at each step, and the gradient tells us exactly in that direction. The direction of the gradient is the direction in which the function rises fastest at a given point, so the opposite direction of the gradient is the direction in which the function falls fastest at a given point, which is exactly what we need. So all we have to do is follow the gradient all the way to the local lowest point!
image.png
Mathematical explanation of gradient descent algorithm
Above we have spent a lot of space introducing the basic idea and scenario hypothesis of gradient descent algorithm, as well as the concept and idea of gradient. Here we begin to mathematically explain the process and ideas of the gradient descent algorithm!
image.png
The meaning of this formula is: J is a function of θ, we are at θ 0, from this point to the minimum point of J, that is, the bottom of the mountain. First we determine the direction of travel, which is the reverse of the gradient. Then we take a distance step, alpha, and reach θ 1!
image.png
Here are a few common questions about this formula:
- What does alpha mean? Alpha is called the learning rate or step length in gradient descent algorithms, which means that we can control the distance of each step by alpha to make sure that we don’t go too far and miss the bottom. Also make sure not to walk too slowly, so that the sun goes down, has not gone down the hill. So the choice of alpha is often important in gradient descent! Alpha can not be too large or too small, too small, may lead to slow to the lowest point, too large, will lead to miss the lowest point!
image.png
- Why do I multiply the gradient by a minus sign? A minus sign in front of the gradient means going in the opposite direction of the gradient! As we mentioned earlier, the direction of the gradient is actually the direction in which the function rises fastest at this point! And we want to go in the direction of the fastest descent, which of course is the direction of the negative gradient, so we have to put a minus sign here
Example of gradient descent algorithm
Now that we have a basic understanding of how gradient descent works, let’s look at a few small examples of gradient descent, starting with functions of one variable
Gradient descent of a function of one variable
Let’s say I have a function of one variable
image.png
Differential of a function
image.png
Initialization, starting from
image.png
Vector for
image.png
According to the formula for gradient descent
image.png
We started the iterative calculation process of gradient descent:
image.png
And here, after four operations, or four steps, you basically reach the lowest point of the function, which is the bottom of the hill
image.png
Gradient descent of a multivariable function
Let’s say we have a target function
image.png
Now we’re going to minimize this function by gradient descent. We can see by looking at it that the minimum is actually the point 0,0. But then, we’ll start with the gradient descent algorithm and work our way up to this minimum! We assume that the initial starting point is:
image.png
The initial learning rate is:
image.png
The gradient of the function is:
image.png
Do multiple iterations:
image.png
We find that we’re almost at the minimum point of the function
image.png
Implementation of gradient descent algorithm
Let’s implement a simple gradient descent algorithm in Python. The scenario is a simple example of linear regression: suppose we now have a series of points, as shown below
image.png
We’re going to use gradient descent to fit this line!
First, we need to define a cost function, here we use the mean square error cost function
image.png
In the public
-
M is the number of data concentration points
-
½ is a constant, so that when you take the gradient, the second power cancels out ½, so you don’t have any extra constants, and it doesn’t affect the result
-
Y is the true y-coordinate value of each point in the data set
-
H is our prediction function. According to each input x, the predicted y value is calculated according to θ, i.e
image.png
We can see from the cost function that there are two variables in the cost function, so it is a multi-variable gradient descent problem. To solve the gradient of the cost function, that is, to differentiate the two variables respectively
image.png
The cost function and gradient, as well as the function form of prediction are defined. We’re ready to start coding. But before we do that, it is important to note that, for the sake of writing code, we will convert all formulas into matrices. It is very easy to calculate matrices in Python, and the code will be very concise.
To translate into a matrix calculation, we observe the form of the prediction function
image.png
We have two variables. To matrix this formula, we can add one dimension to each point x, which is fixed at 1. This dimension will be multiplied by θ 0. This is convenient for us to unify the matrix calculation
image.png
Then we convert the cost function and the gradient into matrix vector multiplication
image.png
coding time
First, we need to define the data set and the learning rate
import numpy as np # Size of the points dataset. m = 20 # Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 0]).0 0 (m, 1) # The Learning Rate alpha. Alpha = 0Copy the code
Next we define the cost function and the gradient of the cost function in terms of matrix vectors
def error_function(theta, X, y):
'''Error function J definition.'''
diff = np.dot(X, theta) - y
return (1./2*m) * np.dot(np.transpose(diff), diff)
def gradient_function(theta, X, y):
'''Gradient of the function J definition.'''
diff = np.dot(X, theta) - y
return (1./m) * np.dot(np.transpose(X), diff)
Copy the code
Finally, the core part of the algorithm is gradient descent iteration
def gradient_descent(X, y, alpha):
'''Perform gradient descent.'''
theta = np.array([1, 1]).reshape(2, 1)
gradient = gradient_function(theta, X, y)
while not np.all(np.absolute(gradient) <= 1e-5):
theta = theta - alpha * gradient
gradient = gradient_function(theta, X, y)
return theta
Copy the code
When the gradient is less than 1E-5, it indicates that it has entered a relatively smooth state, similar to the state of the valley. At this time, the effect of continuous iteration is not much, so you can exit the cycle at this time!
The complete code is shown below
import numpy as np # Size of the points dataset. m = 20 # Points x-coordinate and dummy value (x0, x1). X0 = np.ones((m, 1)) X1 = np.arange(1, m+1).reshape(m, 1) X = np.hstack((X0, X1)) # Points y-coordinate y = np.array([ 3, 4, 5, 5, 2, 4, 7, 8, 11, 8, 12, 11, 13, 13, 16, 17, 18, 17, 19, 0] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 '''Error function J definition.''' diff = np.dot(X, theta) - y return (1./2*m) * np.dot(np.transpose(diff), diff) def gradient_function(theta, X, y): '''Gradient of the function J definition.''' diff = np.dot(X, theta) - y return (1./m) * np.dot(np.transpose(X), diff) def gradient_descent(X, y, alpha): '''Perform gradient descent.''' theta = np.array([1, 1]).reshape(2, 1) gradient = gradient_function(theta, X, y) while not np.all(np.absolute(gradient) <= 1e-5): theta = theta - alpha * gradient gradient = gradient_function(theta, X, y) return theta optimal = gradient_descent(X, Y, alpha) print('optimal:', optimal) print('error function:', error_function(optimal, X, y)[0,0])Copy the code
Run the code and the result is as follows
image.png
The line fitted is as follows
image.png
summary
At this point, we have basically introduced the basic idea of gradient descent method and algorithm flow, and use Python to achieve a simple gradient descent algorithm fitting straight line case! Finally, we went back to the beginning of the proposed scenario assumption: the mountain people actually represents the back propagation algorithm, the mountain path is actually Θ has been looking for parameters in the algorithm, the current point of the steepest is actually in the direction of the cost function in this gradient direction, scene observation tool is used in the direction of the steepest differential. The time before the next observation is defined by the learning rate alpha in our algorithm. You can see that the scenario hypothesis and the gradient descent algorithm do a good job of matching!
Further reading
- Gradient Descent lecture notes from UD262 Udacity Georgia Tech ML Course.
- An overview of gradient descent optimization algorithms.