Red Stone’s personal website: Redstonewill.com
What is a gradient?
We are all familiar with the Gradient Descent Algorithm. No matter in Linear Regression, Logistic Regression or Neural Network, gradient descent algorithm is used. Let’s first look at an intuitive explanation of the gradient descent algorithm:
Suppose we were on the middle of huangshan mountain with endless hills and no way to get down. So I decided to take it one step at a time, that is, go one step at a time in the direction of the steepest and easiest descent of the current position, and then continue to go one step in the direction of the steepest descent of the next position. So we went on, step by step, until we felt that we had reached the foot of the hill. The steepest way down here is in the negative direction of the gradient.
First of all, what is a gradient? In plain English, a gradient means that the directional derivative of a function at that point maximizes along that direction, the derivative of the function at the current position.
On the type,Is the independent variable,Is aboutThe function,Student: Gradient.
Gradient descent algorithm
If the functionIs convex, then it can be optimized using gradient descent algorithm. The formula for gradient descent is already familiar:
Among them,Is the independent variable parameter, namely the coordinates of downhill position,Is the learning factor, that is, one small step (step length) per step down the hill,It’s updatedThat is, after a small step down the hill.
The formula for gradient descent is very simple! But what is the nature of the reason that “along the opposite side of the gradient (the steepest)” is something that we experience every day? Why is it that the fastest local descent is in the negative direction of the gradient? Perhaps many friends are not quite clear. It doesn’t matter. I’m going to explain the mathematical derivation of the gradient descent algorithm in more detail in popular language.
Taylor expansion of the first order
I’m going to need a little bit of math here, a little bit of Taylor expansion. In a nutshell, Taylor’s expansion uses the notion of a local linear approximation of a function. Let’s take the first-order Taylor expansion as an example:
Don’t understand the formula? It doesn’t matter. Let me illustrate it with the picture below.
Convex functionA short paragraph ofShown by the black curve above, it can be solved using the idea of linear approximationAs shown in the red line above. The slope of this line is equal toinThe derivative of phi. It’s easy to get from the equation of the lineThe approximate expression of is:
This is the derivation of first-order Taylor’s expansion, mainly using the mathematical idea of linear fitting approximation of curve functions.
Mathematical principles of gradient descent
Now that we know the first-order Taylor expansion, here’s the thing! Let’s see how the gradient descent algorithm is derived.
First write the expression for the first-order Taylor expansion:
Among them,It’s a minute vector, and its magnitude is the step length that we talked about beforeIn the same way as each step down the mountain,Is a scalar, andThe unit vector of thetaSaid. theCan be expressed as:
In particular,It can’t be too big, because if it’s too big, the linear approximation won’t be good enough, and the first-order Taylor approximation won’t work. After the substitution,Is expressed as:
The point is, the purpose of the local drop is to hope every timeUpdate, all of which can make the function valueSmaller. In other words, in the above equation, we want to. There are:
becauseIs a scalar and is generally set to a positive value, so it can be ignored and the inequality becomes:
This inequality is very important!andThese are vectors,Is the gradient direction of the current position,The unit vector that represents the next step forward, that we need to solve for, will give us the basis fordetermineThe value.
If we want the product of two vectors to be less than zero, let’s see what the product of two vectors involves:
andAre vectors,Is the Angle between two vectors.andThe product of is:
andBoth are scalars atandIn certain cases, as long as, i.e.,andIt’s exactly the opposite, and it makesandMinimum (negative maximum) of the vector product of.
As the name suggests, whenwithThey are the inverse of each other, i.eIs the negative direction of the current gradient direction, can letKeep it as small as possibleIs the direction of the fastest local decline.
knowisIn the opposite direction of, can be directly obtained:
The reason why WE divide byThe moldBecause,It’s the unit vector.
Find the optimal solutionAnd then, plug in to PI, in:
In general, becauseIt’s a scalar that can be incorporated into a stepping factorIs simplified as:
So, we derive the gradient descent algorithmUpdate expression of.
conclusion
We understand the mathematical principle of gradient descent algorithm by using first-order Taylor expansion, linear approximation and minimizing vector multiplication. You may be familiar with gradient descent, but you may not know exactly how it works. Have you gained anything from reading this article?
Simple gradient descent algorithm, do you really understand?
For more machine learning resources, please follow AI Youdao (ID: Redstonewill)