Article source | turbine cloud community (ai/deep learning cloud GPU server address training platform, official experience url: turbine wisdom at cloud)

The original address | NAG optimizer

The original author | Angle of ash


Community talent, xiaobian today and found a treasure moderator ‘corner grey’.

Xiaobian is a person who loves to share, see a good article good author how to control not to share it with you? So, next, follow me to have a quick look at the content of the article!

The text start

While reading the fairseQ source code recently, I found that the implementation of the NAG Optimizer (Nesterov Accelerate Gradient) was slightly different from that of the torch, so I decided to check it out.

Recall Momentm, the gradient descent method introduces the concept of momentum and uses β to slide exponentially weighted average historical gradients. The earliest gradients decay fastest and have little impact on the current update. On the contrary, the closer the gradient is to the current, the greater the influence on the update is. The formula is:

Where Vt, gt and G (θt) respectively represent velocity, gradient and model parameters at time T, μ is momentum coefficient and LR is learning rate. The idea of this method is to smooth the network parameters so that the amplitude of gradient swing is not too large.

NAG is similar to Momentum in that it uses historical gradients to update parameters. The difference is that NAG uses μVt to partially update θt to get θt+μVt, and then uses gradient G (θt+μVt) to get θt+1. The formula is as follows:

Here ε is the learning rate.

The chart below illustrates Momentum and NAG:

Momentum uses the velocity Vt and gradient G (θt) at the current position A to update directly to destination C; NAG, on the other hand, takes A small step in the inertial direction from A to B, which is close to C, and then updates to C with the gradient G (θt+μVt) at B.

It is argued in the paper that this can update V quickly, making Nag more stable than Momentum and more suitable for scenarios with large learning rates. In addition, if Nag updates poorly to point B with μVt, the Nag can be corrected back to the starting point A more quickly because the gradient g(θt +μVt) at point B is greater than that of momentum G (θt).

Without further discussion, here is the formula and code for torch.optim.sgd [1] :

Here is the formula and code for Fairseq.optim.nag [2] :

It can be seen that the two are actually somewhat different, while Fairseq’s NAG is basically consistent with the formula of the paper, as shown in [3] :

β is the momentum coefficient μ in this paper. After θt+βVt is substituted to get θt ‘, θt ‘is finally used as the parameter θt to be updated, that is, θt+βVt is always updated. The explanation of this is shown in the figure below and [4] :

In summary, the NAG optimizer was able to accelerate convergence and was stable with a very high learning rate. No wonder fairSeQ ConvS2S was able to use a learning rate of 0.5.

[1]torch sgd

[2]fairseq nag

[3] An optimization method for deep learning

[4]CS231n Convolutional Neural Networks for Visual Recognition