Gradient descent
With the basics and matrix operations behind us, let’s go back to the linear regression code from the first section:
import tensorflow as tf import numpy as np trX = np.linspace(-1, 1, 101) trY = 2 *trX + np.random.randn(* trx.shape) * 0.33 # Create some random values near linear values X = tf.placeholder("float") Y = tf.placeholder("float") def model(X, w): Multiply (X, w) # X*w W = tf.Variable(0.0, name="weights") y_model = model(X, W) cost = tf. Square (Y - y_model) # with square error as the optimization goal train_op = tf. Train. GradientDescentOptimizer (0.01). Minimize # (cost) Gradient descent optimization # Start creating sessions! Tf.global_variables_initializer ().run() for I in range(100): for (x, y) in zip(trX, trY): sess.run(train_op, feed_dict={X: x, Y: y}) print(sess.run(w))Copy the code
Except for this one.
Train_op = tf. Train. GradientDescentOptimizer (0.01). Minimize (cost)Copy the code
You should be able to make sense of it. What we’re going to do in this video is we’re going to focus on the line that we don’t understand, the gradient descent optimization function.
Let’s start by drawing the graph of the function
Gradient descent, in fact, there is no mystery, it is a problem of finding the extreme value of a function. The one thing that functions are better than matrices is that you can draw them.
So let’s learn how to graph a function:
Import matplotlib.pyplot as PLT import numpy as NP x = np.linspace(-10,10,1000) y = x ** 2-2 * x+ 1 plt.plot(x, x) import matplotlib.pyplot as PLT import numpy as NP x = np.linspace(-10,10,1000) y = x ** 2-2 * x+ 1 plt.plot(x, x) Y) plt.title("matplotlib") plt.ylabel("width") # plt.legend(["X"," y "], loc="upper right") plt.grid(True) plt.show()Copy the code
Y = x 2 − 2 x + 1″ role=”presentation”>The value of the.
The picture looks something like this:
Minimize the function
Now we want to find the minimum point on this curve. Because the domain of this function is infinite, we can’t go from minus infinity to infinity and try the smallest. But, we can start at any point, and we can compare the next point to the left and the next point to the right and see which one is smaller. And then you take the smaller point, and you continue the process.
So let’s say we start at x is equal to negative 5, y is equal to negative 5.(5) – 2(5) + 1 = 36. So this point right here is minus 5,36.
We take the step size of 0.1 to see what the y values of -5.1 and -4.9 are, and find the same as (-5.1, 37.21) and (-4.9,34.81). The value of -4.9 is smaller, so -4.9 becomes the value of the new iteration. Then from (-4.9,34.81) to (-4.8,33.64), and so on. All the way to the minimum of (1.0, 0), it’s going to be bigger than that, left or right, and you find this extreme value.
The function that evaluates the extreme value is called loss function, or cost function, or error function. The names are used interchangeably.
The minimum obtained is called x ∗ = a r g m I n f (x)” role=”presentation”>
This method can solve the problem. But the problem is that it’s slow. You can move a little bit more to speed up when the function is falling fast, and then move a little bit less when the function is falling slower so that you don’t run past it.
So how do you measure the rate of decline? We’ve all learned that derivative, which is the slope of a tangent line. Now that we have the derivative, we can just go in the opposite direction of the derivative, which is switching down.
This technique of reducing the value of f(x) by moving a small step in the opposite direction of the derivative of x is calledGradient descent.
The NTH plus one value is the NTH value minus the derivative times a variable coefficient.
That is
X n + 1 = x n − η d f (x) d x “role=”presentation”>
Where the adjustable coefficient η” role=”presentation”>, we call it the learning rate. The higher the learning rate is, the faster the decline speed will be, but it may also miss the minimum value and produce oscillation.
The way to choose the learning rate is generally to take a small constant. You can also calculate the step size that the derivative disappears. You can also take several more learning rate values, and then take the best one.
F (x) = x 2 − 2 x + 1″ role=”presentation”>X − 2″ role=”presentation”>
We still choose the point (-5,36) as the starting point, and the learning rate is 0.1.
Then the next point $x_2= -5-0.1(2(-5) -2) = -3.80 $
The second point drops to (-3.8,23.04). Compared to the first algorithm, we moved right 0.1, this time we moved right 1.2, 12 times more efficient.
Three dimensional world of order
Let’s look at the loss function we use for linear regression:
cost = tf.square(Y - y_model)Copy the code
σ I = 1 n (y I − (w x I + b)) 2 “role=”presentation”>
Y I “role=”presentation”>X I “role=”presentation”>Is a given input value. There are only two unknowns, w and B.
So the next step in finding gradient descent becomes finding the minimum of a quadratic function of two variables.
So let’s look at the graph, what the graph of a quadratic function of two variables looks like. Let’s take f (x, y) = x 2 + y 2 + x + y + 1For example.
Learn to draw first:
From mpl_toolkits. Mplot3d import Axes3D import numpy as NP import matplotlib.pyplot as PLT # create 3D graphics object FIG = X3 = np.linspace(-10,10,100) X3, Y3 = np.linspace(-10,10,100) X3, Y3 = np.linspace(-10,10,100) X3, Y3 = np.linspace(-10,10,100) Z3 = X3*X3 + Y3*Y3 + X3 + Y3 + X3 + Y3 + 1 ax.plot_surface(X3, Y3, Z3, cmap=plt.cm. Winter)Copy the code
The graph looks something like this:
From a curve to a bowl-shaped surface. But there are still minimum points.
Now we don’t just have one x direction, we have two x and y directions. You can do it in both directions, just in the x direction and in the y direction. Take partial derivatives in the x and y directions.
X n + 1 = x n − ∂ f (x, y) ∂ x, y n + 1 = y n − ∂ f (x, y) ∂ y “role=”presentation”>
D “role=”presentation”>∂” role=”presentation”>But the formula for differentiation is the same. You just treat y as a constant when you’re solving for x.
∇ = (∂ F (x, y) ∂ X, ∂ F (x, y) ∂ y)” role=”presentation”>The inverted triangle is pronounced nabla.
Such basic operation in Tensorflow, gradient descent natural we are not need to worry about, we only need to call the tf. Train. GradientDescentOptimizer.
Let’s review the examples we’ve seen before:
Cost = tf. Square (Y - y_model) # with square error as the optimization goal train_op = tf. Train. GradientDescentOptimizer (0.01). Minimize # gradient descent optimization (cost)Copy the code
The 0.01 assigned to GradientDescentOptimizer is our η” role=”presentation”>, learning rate.
Adaptive gradient evolution history
If you have a strong logical thinking, you will find that when we upgrade from two-dimensional derivative to three-dimensional gradient, there is one parameter that we do not expand with, and that is the learning rate “role=”presentation”>.
F (x, y) = 8 x 2 + y 2 + x + y + 1
Z3 = 8*X3*X3 + Y3*Y3 + X3 + Y3 + 1Copy the code
The image looks like this:
∂ f (x, y) ∂ x “role=”presentation”> ∂ f (x, y) ∂ x” role=”presentation”>∂ f (x, y) ∂ y “role=”presentation”> ∂ f (x, y) ∂ y” role=”presentation”>The rate of decline in the direction is obviously different.
For higher dimensions, the problem may be more serious and difficult to agree on.
The simplest approach is probably to have a specific learning rate for each dimension. However, it was too much trouble. At that time, Adaptive Gradient was proposed by John C. Duchi, a doctoral student in UC Berkeley at that time, and the short name is AdaGrad. In Tensorflow, we can through the tf. The “train”. To use AdaGrad AdagradOptimizer.
AdaGrad’s formula is (x t + 1) I = (x t) I − η σ τ = 1 T (∇ F (x t)) I 2 (∇ F (x T)) I “role=”presentation”>
But AdaGrad has its own problems. η” role=”presentation”>Value can be adaptive, but if the η” role=”presentation”>Too much will still cause optimization instability. But if it’s too small, as you optimize, the learning rate gets smaller and smaller, and maybe you don’t get to the minimum and you stop.
So Mathew Zeiler, then an intern at Google, made two improvements: first, he introduced time Windows, and second, he didn’t have to manually specify learning rates. The improved algorithm is called AdaDelta.
The RMS used in the formula refers to Root Mean Square.
The formula of AdaDelta is (∇ x I) I = − (R M S [δ x] t − 1) I (R M S [g] t) I (∇ F (x I)) I “role=”presentation”>
See the paper written by Zeiler for details:Arxiv.org/pdf/1212.57…
As long as we use in Tensorflow tf. Train. AdadeltaOptimizer the encapsulation.
AdaDelta is not the only improvement, there are also RMSProp, Adam, etc. We can use the tf. Train. RMSPropOptimizer and tf. Train. AdamOptimizer to invoke.
Remember from section 1 when we talked about convolutional neural networks? Instead of gradient descent, we use RMSProp:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=py_x, Labels = Y)) train_op = tf. Train. RMSPropOptimizer (0.001, 0.9). Minimize (cost) predict_op = tf. Argmax (py_x, 1)Copy the code
Implications of Physics
Although there are a variety of adaptive gradient descent methods, complex ones are not necessarily the best. The physical meaning of a derivative can be thought of as velocity. We can refer to the concept of momentum in physics and introduce the concept of inertia. If a pothole is encountered, it is easier to get out by momentum, whereas other methods take a long time to iterate. In Tensorflow, can through the tf. Train. MomentumOptimizer to use momentum method.
Take a look at Mathew Zeiler’s paper for a comparison of several approaches:
As you can see, the momentum method is not as stable as AdaGrad in the early stage, but it still works best in the late stage.