Permeability optimization algorithm: from stochastic gradient, stochastic gradient descent method to Newton method, conjugate gradient

What is gradient descent

Gradient descent is an algorithm that is often seen in optimization problems in machine learning. What exactly is gradient descent?

Wikipedia defines Gradient Descent as a first-order optimization algorithm, commonly known as the fastest descent method. To find the local minimum of a function using the gradient descent method, one must search iteratively for a point with a specified step in the direction opposite to the gradient (or approximate gradient) corresponding to the preceding point of the function. If the search iterates in the positive direction of the gradient, it will approach the local maximum point of the function. This process is called gradient ascent.

Well, again, what is a gradient? To avoid all the complications, we can simply say, in the case of a real valued function of one variable, the gradient is the derivative, or, in the case of a linear function, the slope of the line.

1.1 Example of gradient descent method

For example, when we want to do a system to evaluate the value of a house, what are the factors that determine or influence the value of a house? For example, the area, the size of the house (how many rooms and halls), the lot, the orientation and so on, these variables that affect the house value are called features. Here, for the sake of simplicity, we assume that houses are affected by only one variable, and that is the area of the house.

Suppose there is a housing sales data as follows:

Area (M ^2) Selling Price (ten thousand yuan)

123, 250,

150, 320,

87, 160,

102, 220,

… …

As an aside, by the way, this housing price data might have bought houses around the 5th ring road of the Imperial Capital five years ago, but now it can only buy houses in second-tier cities.

We can draw a graph, and on the X-axis is the area of the house. On the Y-axis is the selling price of a house, as follows:

If a new house/square footage comes along, let’s say it’s not in the record of the house sale price, what do we do?

We can use a curve to fit the data as accurately as possible, and then if there is a new input area, we can return the value corresponding to this point on the curve. If a straight line was used to fit the housing price data, it might look something like this:

And the green dots are the ones we want to predict.

For mathematical modeling, some concepts and common symbols are given first.

Housing sales records – Training sets or training data are the input data in our process, commonly referred to as X
Housing sales price – Output data, usually called Y
The fitting function (or hypothesis or model) is usually written as y = h(x).
A training set consists of a pair of input data and output data
Dimension of input data (number of features, #features), n

Then there is a typical machine learning process, first given an input data, our algorithm will go through a series of processes to get an estimated function, this function has the ability to give a new estimate of the new data that has not been seen, also known as building a model.

We use X1, X2.. Xn describes the components of the feature, such as x1= area of the room, x2= orientation of the room, etc., and we can make an estimation function:

θ is called a parameter here, where it means to adjust the influence of each component in the feature, that is, whether the area of the house or the lot of the house is more important.

If we set X0 = 1, we can write it as a vector:

Our program also needs a mechanism to evaluate whether our θ is good or not, so we need to evaluate our H function. Generally, this function is called loss function, which describes the bad degree of h function. Here, we call this function J function.

In other words, we take as the loss function the sum of the squares of the difference between our estimate of x(I) and the true value of y(I), multiplying the coefficient by 1/2 to make it easy to take the derivative (and the coefficient cancels out when we take the derivative).

There are many ways to adjust θ so that J(θ) minimizes, including least square (min square), which is an entirely mathematical description, and gradient descent.

1.2 Gradient descent algorithm process

The algorithm flow of gradient descent method is as follows:

1) First assign a value to θ, either randomly, or let θ be an all zero vector.

2) Change the value of θ so that J(θ) decreases in the direction of gradient descent.

To make the description clearer, the following figure is given:

This is a graph of the error function J(θ), and the red is the high value of J(θ), and what we want is to get the value of J(θ) as low as possible, as low as the dark blue. Theta 0 and theta 1 represent the two dimensions of the theta vector.

The first step in the gradient descent method mentioned above is to give an initial value of θ, assuming that the initial value given at random is a cross point on the graph.

Then we adjust θ in accordance with the direction of gradient descent, which will make J(θ) change to a lower direction, as shown in the figure below. The algorithm will end with θ falling until it can no longer fall.

Of course, the final point of possible gradient descent is not the global minimum, that is, it may also be a local minimum, as shown in the figure below:

The figure above is a local minimum point described, which is obtained by re-selecting an initial point. It seems that our algorithm will fall into the local minimum point to a large extent affected by the selection of initial point.

Let me use an example to describe the gradient reduction process by taking the partial derivative of our function J(θ) :

Here’s the update, where theta I decreases in the direction of the smallest gradient. θ I represents the value before the update, the following part represents the amount decreased in the gradient direction, and α represents the step size, which is how much each time the gradient decreases.

A very important place in it is worth noting that the direction of the gradient is, for a vector theta, each d component theta. I can work out a gradient direction, we can find the direction of a whole, at the time of change, we change in the direction of dropped most can achieve a minimum point, whether it is local or global.

In simpler mathematical terms, the description looks like this:

1.3 Is the gradient descent method the fastest

The gradient descent method is not necessarily the fastest descending direction, it is only the fastest descending direction of the objective function in the tangent plane of the current point (of course, higher dimensional problems cannot be called planes). In practical implementation, Newtonian direction (considering Hessian matrix) is generally considered to be the fastest descending direction, which can reach the convergence speed of SuperLinear. The convergence rate of gradient descent algorithms is generally linear or even Sublinear (for some problems with complex constraints). Gradient descent is usually explained by going downhill. Suppose you are at the top of the mountain and you have to reach the lake at the bottom of the mountain (the lowest part of the valley). But the trouble is, you’re blindfolded and you can’t tell where you’re going. In other words, you can no longer tell at a glance which path is the fastest downhill, as shown below:

The best way to do this is to take a step at a time. Take a step in all directions with your feet, test the terrain, and feel with your feet which direction is the biggest drop. In other words, each time you reach a position, solve for the gradient of the current position, and take a step in the negative direction of the gradient (down from the current steepest position). In this way, each step to take according to the position of the previous step to choose the current steepest and fastest downhill direction to take the next step, step by step, until we feel that we have reached the foot of the mountain. Of course, we may not necessarily reach the foot of the mountain, but only a local low point. In other words, gradient descent may not be able to find a global optimal solution, or it may only be a local optimal solution. Of course, if the loss function is convex, the solution obtained by the gradient descent method must be globally optimal.

In short, the optimization idea of gDA is to use the negative gradient direction of the current position as the search direction, because this direction is the fastest descent direction of the current position, so it is also known as the “fastest descent method”. The closer the fastest descent method is to the target value, the smaller the step size and the slower the progress. The search iteration of gradient descent method is shown in the figure below:

Because the convergence rate of gradient descent method is obviously slower in the region close to the optimal solution, the solution using gradient descent method requires many iterations. In machine learning, two gradient descent methods are developed based on the basic gradient descent method, namely stochastic gradient descent method and batch gradient descent method.

2 Stochastic gradient descent

The general gradient descent algorithm traverses the whole data set when updating the regression coefficient, which is a batch processing method. In this way, when the training data is extremely busy, the following problems may occur:

Convergence can be very slow;
If there are multiple local minima on the error surface, there is no guarantee that the process will find a global minimum.

2.1 Stochastic gradient descent

To solve this problem, in practice we use a variant of gradient descent called stochastic gradient descent.

The error in the above formula is obtained for all training samples, while the idea of stochastic gradient descent is to update the weight according to each individual training sample, so our gradient formula above becomes:

After deduction, we can get the final weight update formula:

With the updated formula for the above weights, we can continuously adjust the weights according to the results we expect by entering a large number of sample instances, resulting in a set of weights that enable our algorithm to get the correct or infinitely close result for a new sample input.

Here’s a comparison

Let the cost function be

Parameter updated to:

I is the subscript of sample number, j is the subscript of sample dimension, m is the number of samples, and n is the number of features. So it only takes one sample to update a theta j.

The following two pictures can be very image contrast of various optimization methods (photo source: sebastianruder.com/optimizing-…

Performance of SGD optimization methods on loss surface

As can be seen from the figure above, Adagrad, Adadelta and RMSprop can immediately transfer to the correct moving direction to achieve rapid convergence on the loss surface. Momentum and NAG can lead to off-tracks. At the same time, NAG is able to quickly correct its course after deviation because it improves responsiveness based on gradient correction.

Performance of SGD optimization methods at saddle point of loss surface

2.2 Batch gradient descent

Parameter updated to:

I is the subscript of sample number, j is the subscript of sample dimension, m is the number of samples, and n is the number of features. So updating a θj requires traversing the entire sample set

In addition, how to optimize the stochastic gradient method? For more details, please click: Thesis Open course I: Detailed analysis of gradient descent and other optimization algorithms (including video and PPT download).

3 Newton’s method

3.1 Newton’s method

Newton’s method is a method for approximating equations in the field of real numbers and complex numbers. The method is to use the first few terms of the Taylor series of the function f (x) to find the roots of the equation f (x) = 0. The greatest characteristic of Newton’s method is that it converges quickly.

Specific steps

First, select an x0 that is close to the zero of the function f(x) and compute the corresponding f(x0) and the slope of the tangent line F ‘(x0) (where f’ represents the derivative of the function f).

Then we compute the x-coordinate of the X-axis intersection of the line through the point (x0, f(x0)) with slope F ‘(x0), that is, solve the following equation:

We’ll call our new point x1, and in general x1 will be closer to the solution to the equation f(x) = 0 than x0. So we can now start the next iteration with X1. The iterative formula can be simplified as follows:

It has been shown that if f prime is continuous and the zero x to be found is isolated, then there is a region around zero x, and Newton’s method must converge as long as the initial value, x0, is in that vicinity. And, if f'(x) is not zero, then Newton’s method has the property of square convergence. Roughly speaking, this means that with each iteration, the significant number of Newtonian results doubles.

Because Newton’s method is based on the tangent line of the current position to determine the next position, Newton’s method is also vividly called the “tangent line method.” The search path of Newton’s method (two-dimensional case) is shown in the figure below:

Comparison of efficiency between Newton method and gradient descent method:

In terms of convergence speed, Newton’s method is second-order convergence, while gradient descent is first-order convergence. However, Newton’s method is still a local algorithm, but it is more detailed locally. The gradient method only considers the direction. Newton’s method not only considers the direction but also the size of the step.
According to the explanation on the wiki, say from geometry, Newton’s method is to use a quadric surface to fit the local surface of your current location, and the gradient descent method is to use a plane to fitting the current local curved surface, usually, the quadric surface fitting is better than flat, so Newton method choice of descent path is more accord with the real optimal path.

Note: The red iteration path of Newton’s method and the green iteration path of gradient descent method.

Summary of advantages and disadvantages of Newton method:

Advantages: second order convergence, fast convergence;
Disadvantages: Newton method is an iterative algorithm, and every step needs to solve the inverse matrix of the Hessian matrix of the objective function, so the calculation is complicated.

Quasi-newton Methods, one of the most effective Methods for solving nonlinear optimization problems, were put forward by W.C.Davidon, a physicist at Argonne National Laboratory in the United States, in 1950s. At the time, Davidon’s algorithm was considered one of the most innovative inventions in nonlinear optimization. It was not long before R. Fletcher and M. J. D. Powell proved that the new algorithm was far faster and more reliable than other methods, making the discipline of nonlinear optimization leap forward overnight.

Wikipedia explains the quasi-Newtonian method this way

Quasi – Newton method is a kind of algorithm based on Newton method to solve nonlinear equations or continuous optimization problem function zero, maximum, minimum. The quasi-Newtonian method comes in handy when the Hessian matrix required by Newton’s method is difficult or even impossible to compute.

The essential idea of quasi-Newton’s method is to improve the defect that Newton’s method needs to solve the inverse of complex Hessian matrix every time. It uses positive definite matrix to approximate the inverse of Hessian matrix, thus simplifying the operation complexity.

Like the steepest descent method, the quasi Newtonian method only requires that the gradient of the objective function be known at each iteration. By measuring the variation of the gradient, a model of the objective function is constructed that is sufficient to produce superlinear convergence. Such methods are vastly superior to the fastest descent method, especially for difficult problems. In addition, the quasi-Newtonian method is sometimes more efficient than Newton’s method because it does not require information about the second derivative. Today, optimization software contains a large number of quasi-Newtonian algorithms to solve unconstrained, constrained, and large-scale optimization problems.

The basic idea of quasi Newtonian method is as follows.

Just like Newton’s method,Quasi-newton methodIs to use oneQuadratic functionTo approximateThe objective function. The second orderTaylor expansionis

Among them,saidtheThe gradient ,saidHessian matrixGradientIt can be further approximated in the following form

Let this be equal toTo calculate theNewton step.

Then constructThe approximatemeet

The type is calledSecant equations. But whenIs a function defined in multidimensional space, computed from this formulaWill become a *indefiniteProblem * (there are more unknowns than equations), according to theNewton stepupdateThe current solutionWe need to go back to solving secant equations. Almost different quasi – Newtonian methods have different ways of selecting secant equations. And most methods assume thatwithsymmetry(i.e., to meetIn addition, the methods shown in the following table can be used to solve the problem; At this point,In somenormwithAs close as possible. That is, for somePositive definite matrices, updated in the following manner:

The approximateHessian matrixGeneral withUnit matrixEtc as the initial value. The solution to the optimization problemDerived by approximationTo calculate theNewton stepUpdate to get.

The following is a summary of the algorithm:

Compute omega at the new iteration pointThe gradient
make
using, direct approximationHessian matrixtheInverse matrix.

4 Conjugate gradient

Conjugate gradient method is between the gradient descent method, the steepest descent method, a method with Newton’s method, it only using first derivative information, but overcomes the drawback of gradient descent method converges slowly, and avoids the Newton’s method need to be stored and calculating Hessian matrix and the disadvantage of the inverse conjugate gradient method is not only one of the most useful method to solve large linear equations, It is also one of the most efficient algorithms for solving large nonlinear optimization.

Among all kinds of optimization algorithms, conjugate gradient method is very important. It has the advantages of small storage, gradual convergence, high stability, and does not need any external parameters.

The following figure shows the path comparison between conjugate gradient method and gradient descent method for searching optimal solutions:

Permeability optimization algorithm: from stochastic gradient, stochastic gradient descent method to Newton method, conjugate gradient