“This is the first day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021”

First imagine a scenario with me 🎞

There is a man 👨🦱 who is stuck on a mountain and wants to get down (need to find the lowest point of the mountain, which is the valley). However, there is a heavy fog on the mountain at this time, resulting in poor visibility. Therefore, the path down the mountain is uncertain, and he must use the information around him to find the path down the mountain, step by step 🔎.

🎇 At this point, he can use the gradient descent algorithm to help him down the mountain. Specifically, he each to a place, to find the location of the steepest place, and then a small step, then in the new position, continue to find the steepest direction, and then a step down, so a small step, you can smoothly walk to the foot of the mountain 🎉.

☝ That’s an intuitive idea of gradient descent.

As you can see, the key to a successful descent lies in finding the steepest direction for your current position and determining how long to take. In gradient descent, these two variables are given higher-order names: gradient and step size.

Next, let me study you ~🧐

Concepts and algorithms

The basic process of gradient descent is very similar to the descent scene.

First of all, we have a differentiable function, and that function represents a mountain. Our goal is to find the minimum value of this function, which is the bottom of the mountain. The previous scenario assumes that the fastest way down the hill is to find the steepest direction at your current location, and then go down in that direction, corresponding to the function, which is to find the gradient at a given point, and then go in the opposite direction of the gradient (wondering why the opposite direction? Then look behind ~), can let the function value drop the fastest! After we reach the next point, we repeatedly find the steepest point (find the gradient) and finally reach the local minimum.

It’s kind of like when we go down the mountain. Taking the gradient determines the steepest direction, which is the way to measure direction in a scene.

1. The gradient

The gradient, along which the greatest rate of change can be obtained, is the steepest direction of the current position.

How do you calculate the gradient?

With one variable, you just take the derivative of the variable, and the gradient is just a function of position, so each position has a gradient.

If you have multiple variables, then you take the partial derivatives of each variable separately, and then you combine them together to get a multiple vector.

We need to get to the bottom of the mountain, and we need to look at the steepest part at each step, and the gradient tells us exactly in that direction. The direction of the gradient is the direction in which the function rises the fastest at a given point, so the opposite direction of the gradient is the direction in which the function falls the fastest at a given point, which is exactly what we need, which explains why we put a minus sign in front of the gradient in the formula.

This is an iterative formula. The meaning of this formula is: F (theta (k)) (f \ theta ^ {(k)}) f (theta (k)) is a function of theta \ theta theta our current position as the theta k \ theta ^ {k} theta k points, is to go from this point to f (theta (k)) f (\ theta ^ {(k)}) f (theta (k)) of the minimum point, Which is to go to the bottom of the mountain.

So first we need to determine the direction of travel, which is the reverse of the gradient, and then walk a distance of step η\etaη to the point θ(k+1)\theta^{(k+1)}θ(k+1)!

2. Step length

Step size is a variable used to control the distance taken in each step. Step size selection is often important in gradient descent! Step size can not be too big or too small, too small may lead to go too slow to the lowest point, too big will lead to miss the lowest point!

Let’s see, if f(x)=x2f(x)=x^2f(x)=x2, then the iteration formula is

Here is the same starting position, same downhill target, different step length 👇

3. Detailed algorithm

Input: objective function f(θ)f(\theta)f(θ), step size η\etaη, calculation accuracy ϵ\epsilonϵ;

Output: f(θ)f(\theta) the minimum point of f(θ) θ∗\theta^*θ∗.

  • (1) select the initial value θ(0)\theta^{(0)}θ(0), set k=0k=0k=0;
  • (2) to calculate f (theta (k)) f (\ theta ^ {(k)}) f (theta (k));
  • (3) Calculate gradient ∇ F (θ(k))\ Nabla F (\ Theta ^{(k)})∇ F (θ(k));
  • (4) using the iterative formula theta (k + 1) = theta (k) – eta ∇ f (theta (k)) \ theta ^ {} (k + 1) = \ theta ^ {(k)} – eta \ \ nabla f (\ theta ^ {(k)}) theta (k + 1) = theta (k) – eta ∇ f (theta (k)) to update the parameters;

If ∣ ∣ f (theta (k + 1)) – f (theta (k)) ∣ ∣ < ϵ | | f (\ theta ^ {} (k + 1)) and f (\ theta ^ {(k)}) | | < \ epsilon ∣ ∣ f (theta (k + 1)) – f (theta (k)) ∣ ∣ < ϵ, stop the iteration, Let θ∗=θ(k+1)\theta^*=\theta^{(k+1)}θ∗=θ(k+1), and output the result.

  • (5) Otherwise, let k=k+1k=k+1k=k+1, and transfer (3).

That is, continue iterating, updating parameters until the termination condition is satisfied.

Gradient descent of a function of one variable

Gradient descent of a multivariable function

We find that we’re almost at the minimum point of the function

What is gradient descent all about?

In machine learning, neural networks, and various mathematical modeling problems, there is always a loss function/cost function/error function 🤔

The gradient descent function is to find the value of the independent variable corresponding to the minimum value of the function (the value of x corresponding to the lowest point of the curve). Again, it is to find the independent variable, not the value of the function ❕ An algorithm with different parameters produces different fitting curves, which means different errors. Gradient descent is the most commonly used method to optimize the loss function, to find the parameters of the algorithm to minimize the error value.

Gradient descent family

Brainwashing concepts:

  • Hypothesis function: In supervised learning, a function used to fit an input sample.
  • Objective function/loss function/cost function/error function: A function used to measure the degree of fit in order to evaluate how well the model fits. Loss function minimization means that the fitting degree is the best, and the corresponding model parameters are the optimal parameters.

For easy understanding, take the linear regression containing only one feature as an example. Assume that the assumed function of linear regression is (I =1,2… M represents the number of samples) :

In linear regression the loss function is usually the square of the difference between the sample output and the hypothesis function. Therefore, the corresponding mean square error loss function here is:

1. Batch Gradient Descent (BGD)

Batch gradient descent method is the most commonly used form of gradient descent method. The specific method is to use all samples to update parameters.

(1) the loss function J (theta 0, theta 1) J (theta _0, theta _1) J (theta zero, theta. 1) the partial derivatives (x0 (I) = 1 x_0 ^ {} (I) = 1 x0 (I) = 1) :

(2) Update parameters in each iteration:

❗ It can be seen from the formula that there is a sum term when updating here, that is, all the data of the training set shall be used for calculation and processing of all samples in each iteration step.

advantages

(1) An iteration is to calculate all the samples, at this time, the matrix can be used to operate, realizing parallelism. (2) The direction determined by all data sets can better represent the sample population, so that it can more accurately move in the direction of the extreme value. When the objective function is convex, the global optimization can be obtained.

disadvantages

When the number of samples is large, each iteration needs to calculate the gradient of all samples, and the training speed is slow.

2. Stochastic Gradient Descent (SGD)

Different from the batch gradient descent method, the random gradient descent method uses one sample to update the parameters in each iteration, which speeds up the training speed.

Stochastic gradient descent method in the front of the steps are the same as the gradient descent method, and the same to the loss function J (theta 0, theta 1) J (theta _0, theta _1) J (theta 0, theta 1) partial guide (x0 (I) = 1 x_0 ^ {} (I) = 1 x0 (I) = 1) :

The difference is that 💥 updates the parameters with each iteration:

❗ As can be seen from the formula, there is no summation term in the update, that is, only one sample is selected for each iteration step.

advantages

Because the loss function on a certain training data is randomly optimized in each iteration instead of all the training data, the updating speed of parameters in each iteration is greatly accelerated.

disadvantages

(1) Decreased accuracy. The stochastic gradient descent method cannot achieve linear convergence even when the objective function is strongly convex. (2) It may converge to the local optimum, because a single sample cannot represent the trend of all samples. (3) not easy to implement in parallel.

3. Mini-batch Gradient Descent (MBGD)

Small batch gradient descent method is a compromise scheme between stochastic gradient descent method and batch gradient descent method. Its idea is to update parameters with “BATCH_size” samples in each iteration. Generally, set batch_size=10 and sample number M =1000. Of course, the batCH_size value can be adjusted according to the sample data.

The corresponding update formula is:

advantages

(1) Through matrix operation, optimizing neural network parameters on one batch at a time will not be much slower than that on a single data. (2) Using one batch each time can greatly reduce the number of iterations required for convergence and make the convergence result closer to the effect of gradient descent. (3) Parallelization can be realized.

disadvantages

(1) Improper selection of batCH_size may bring some problems.

The resources

Gradient descent method and its realization – six foot tent

Summary of gradient descent – Liu Jianping Pinard