What is an Optimizer?

The more complex a neural network is, and the more data it contains, the more time it takes to train it. The reason is very simple, because the calculation is too much. But sometimes in order to solve complex problems, complex structures and big data are inevitable, so we need to find some ways to make neural networks smart and fast. Methods that Speed Up neural network Training are called optimizers.

Common optimizers

TensorFlow Optimizer API: TensorFlow Optimizer API: TensorFlow Optimizer API

tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdagradOptimizer
tf.train.AdagradDAOptimizer
tf.train.MomentumOptimizer
tf.train.AdamOptimizer
tf.train.FtrlOptimizer
tf.train.ProximalGradientDescentOptimizer
tf.train.ProximalAdagradOptimizer
tf.train.RMSPropOptimizer
Copy the code

Visually compare several optimizers

Example 1

  • Momentum and NAG were the two Momentum optimizers, followed by AdaGrad, AdaDelta and RMSProp, and SGD was the slowest.

(2) In terms of convergence trajectory

  • The two momentum optimizers, while running at high speed, took a long fork in the middle and early stages.
  • Of the three adaptive optimizers, Adagrad initially took a side road, but then quickly adjusted, but took the longest road than the other two. AdaDelta and RMSprop run in a similar trajectory, but RMSprop wobbles significantly as it approaches the target.
  • SGD takes the shortest and more positive path than other optimizers.

Example 2

  • Three adaptive learning rate optimizers did not enter the saddle point, of which, AdaDelta fell the fastest, while Adagrad and RMSprop were neck and neck.
  • Two Momentum optimizers, Momentum and NAG and SGD, both entered the saddle point. But the momentum optimizers wobbled at the saddle point for a while, then escaped and quickly descended, overtaking Adagrad and RMSProp.
  • Unfortunately, SGD entered the saddle point, but stayed there and did not descend further.

How to pick the right optimizer

In fact, from the two visualizations above we can see that SGD is the slowest, but this does not mean that it is the optimizer we use most in practice. After all, speed is not the only determinant in practical use, accuracy is. So said.

  1. When studying and debugging our neural network, we can use some fast optimizers, such as Adagrad, RMSProp, etc
  2. The research is done, the model is built, and if you need accurate results to publish papers and so on, it’s a good idea to use every optimizer because you don’t know which optimizer will get the best results for your network.