Editor’s note: Google’s Reddi et al. ‘s paper on Adam convergence, recently named ICLR’s Best paper of 2018, proposes a variant of Adam, AMSGrad. So, in practice, can AMSGrad replace Adam (one of the most popular optimization methods in deep learning)? Let’s take a look at an experiment conducted by Dr. Filip Korzeniowski of The University of Linz (JKU) in Austria.

In ICLR 2018’s best paper On the Convergence of Adam and Beyond, Google’s Reddi et al. pointed out the defects of Adam Convergence proof and proposed a variant of Adam algorithm, AMSGrad. The paper demonstrates the advantages of AMSGrad through a synthesis task and a few experiments. However, it only uses small networks (single layer MLP on MNIST, small convolutional network on CIFAR-10) and does not indicate test accuracy (obviously, we care more about accuracy than cross entropy loss). In terms of training and test losses, the convolutional network they trained on CIFAR-10 was much worse than the most advanced results (we don’t know the accuracy).

Since I’m used to Adam in general, I decided to evaluate AMSGrad on a more realistic network. Note that the model I’m training here is also small and not currently state-of-the-art, but it’s definitely more realistic than the paper shows.

I modified Lasagne’s Adam implementation to implement AMSGrad. I also added an option to turn off Adam’s bias correction.


     
  1. Def amsgrad(LOss_or_grads, Params, Learning_rate =0.001, beta1=0.9,

  2. Beta 2 = 0.999, epsilon = 1-8, e bias_correction = True) :

  3.    all_grads = get_or_compute_grads(loss_or_grads, params)

  4.    t_prev = theano.shared(utils.floatX(0.))

  5.    updates = OrderedDict()

  6. # Use theano constant to avoid upcast to float32

  7.    one = T.constant(1)

  8.    t = t_prev + 1

  9.    if bias_correction:

  10.        a_t = learning_rate*T.sqrt(one-beta2**t)/(one-beta1**t)

  11.    else:

  12.        a_t = learning_rate

  13.    for param, g_t in zip(params, all_grads):

  14.        value = param.get_value(borrow=True)

  15.        m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),

  16.                               broadcastable=param.broadcastable)

  17.        v_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),

  18.                               broadcastable=param.broadcastable)

  19.        v_hat_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),

  20.                                   broadcastable=param.broadcastable)

  21.        m_t = beta1*m_prev + (one-beta1)*g_t

  22.        v_t = beta2*v_prev + (one-beta2)*g_t**2

  23.        v_hat_t = T.maximum(v_hat_prev, v_t)

  24.        step = a_t*m_t/(T.sqrt(v_hat_t) + epsilon)

  25.        updates[m_prev] = m_t

  26.        updates[v_prev] = v_t

  27.        updates[v_hat_prev] = v_hat_t

  28.        updates[param] = param - step

  29.    updates[t_prev] = t

  30.    return updates

Copy the code

First, let’s check that the implementation is correct. I ran the synthesis experiment outlined in the paper (using a random configuration). The following figure shows the learning curve.

This is pretty close to the graph in the paper and the result of the re-implementation of YCMario. Adam mistakenly converges to 1, while the best value for x is -1.

I then experimented with four configurations:

  • Logistic regression on MNIST.

  • CifarNet on CIFAR-10 (CNN described in the paper).

  • SmallVgg on cifar-10 (a SmallVgg style network)

  • Vgg on CIFAR-10 (a larger Vgg style network)

In each configuration, I ran the training five times each for all of the following combinations:

  • Beta2 ∈ {0.99, 0.999}

  • Learning rate ∈ {0.01, 0.001, 0.0001}

  • Bias correction ∈ {on, off}

The batch size is 128 (same as the original paper). After 150 epochs were trained, the initial learning rate decreased linearly until it dropped to 0 after the 150th epoch. In the CIFAR-10 experiment, I also used the standard left-to-right flip data enhancement. See GitHub FDLM/amsgrad_Experiments for the corresponding code.

Let’s focus first on the Vgg configuration. While each of these configurations has its quirks, the conclusions we can draw from the results are similar.

Vgg CIFAR – 10

The convolutional network structure I use here is as follows:

Each layer is followed by the Batch Normalisation and Rectifier activation functions. The dotted line marks dropout, and the corresponding probability is indicated above the line. See the code for details.

In the graph below, Adam’s results are shown in blue and AMSGrad’s results are shown in red. The lighter color indicates that bias correction is off. Each line represents a training session.

Training loss

In my experiments, the training loss was very close to zero, whereas in the paper it was 0.3. This is of course because I use a much larger model. We also see that AMSGrad seems to converge faster in the final training phase if the learning rate is not too low, and that the use of bias correction facilitates convergence in most cases. However, in the end all variants converge to similar final losses under most hyperparametric configurations (except for low learning rates). If we look closely, we find that Adam surpassed AMSGrad in the last epoch. It’s no big deal — it’s a training loss, after all. However, the results of such models (and training programs) are quite different from the CifarNet model reported in the paper, where training losses for AMSGrad are much lower than for Adam.

Verify the loss

Verification loss behaves differently. We see that AMSGrad continues to surpass Adam, especially in the later epoch. Both algorithms achieved similar minimum validation errors (around epoch 20-25), but since then Adam seems to have overfitted more, at least in terms of cross entropy losses. So can AMSGrad make up for the fact that Adam is often less general than standard SGD (random gradient descent)? Some standard SGD tests are needed to answer this question.

Training accuracy

Let’s look at the accuracy. Accuracy is a much more important measure than cross entropy loss in classification problems. After all, what we care about is how many samples our model successfully classifies. First, let’s look at training accuracy. In the case of CIFAR-10, most powerful models can achieve close to 100% accuracy if trained properly (in fact, even with random tags, they can still achieve 100% accuracy).

As we can see, training accuracy and training loss performance are similar: AMSGrad converges faster than Adam, but the end result is similar. If the hyperparameters are not too outrageous, we can achieve almost 100% accuracy (again, not too low a learning rate) regardless of the algorithm we use.

Finally, the most interesting part comes to verifying accuracy.

Validation accuracy

This is a time of disappointment. Although AMSGrad has a lower validation loss for all hyperparameter configurations, it is basically a tie in terms of validation accuracy. Adam performed better in some Settings (LR = 0.01, B2 = 0.999) and AMSGrad (LR = 0.001, B2 = 0.999) in others. Adam had the best results of all configurations (LR = 0.001, B2 = 0.99), but I don’t think the difference was significant. It seems that, given appropriate hyperparameter Settings, both algorithms perform about equally well.

discuss

This paper points out a fatal flaw in Adam convergence proof and shows a proven variant of Adam, AMSGrad. However, I found that the paper’s experimental assessment of the actual impact of AMSGrad was limited. In the paper, the authors claim (emphatically added by me) that “AMSGrad performed much better in training loss and accuracy than Adam. In addition, this performance improvement translates into test losses as well.” Unfortunately, with the models and training regimens I used, my experiments could not confirm these:

  • Both performed similarly on the training set, whether based on loss or accuracy.

  • The AMSGrad test (in this case, validation) does have low losses. However,

  • The increase in test losses did not translate into better test accuracy (to be fair, the authors never claimed this).

  • The difference in performance between test loss and test accuracy raises a key question: How appropriate is a neural network for classification based on category cross entropy training? Can we do better?

For the record: I really appreciate the work of the paper authors in pointing out Adam’s weaknesses. Although I haven’t verified the proof (obviously, until this paper, no one has verified Adam’s proof), I tend to believe their conclusions. At the same time, artificial samples did show that Adam did not work under certain conditions. I think it’s a good paper.

However, the actual impact requires more experience-based exploration. The experiments I describe in this article do not show that there is a big difference in practice between Adam and AMSGrad.


The appendix

Here I show the results of the other Settings.

Logistic regression on MNIST

This setup is similar to an experiment in the paper. The only difference is the learning rate: the paper uses alpha /√t, where T is iteration, whereas I use the linear decay after each epoch mentioned earlier. Let’s look at the results.

Training loss

The first thing we see is that linear learning rate decay leads to much lower training losses. In the paper, the training curve appears to remain flat at about 0.25 after more than 5000 iterations (batch size 128, about 13 epochs). In my test, it reached 0.2 depending on the hyperparameter. Another interesting finding was that Adam achieved the lowest final training loss of all configurations. This is contrary to the results in the paper. In particular, the paper claims that, compared with Adam, AMSGrad is more robust to parameter changes.

Verify the loss

Things are different here. AMSGrad generally gives the lowest validation loss (unless we use too low a learning rate). However, Adam achieved the lowest validation loss at LR = 0.002 and B2 = 0.99 (the difference was negligible), but then diverged.

Training accuracy

The performance of training accuracy is very similar to training loss, as is the performance of validation accuracy.

Validation accuracy

According to these results, AMSGrad has no clear advantage compared with Adam in practice. Results depend on learning rate and beta2 selection. Again, the results do not confirm the claim that AMSGrad is more resilient to hyperparametric changes.

CIFAR CifarNet – 10

I tried to reproduce the convolutional neural network described in the paper. However, I was unable to retrieve the training. Depending on the learning rate and BETA2, I either got a better training loss (but a worse validation loss) or both were worse than in the paper. Since the paper does not provide all the details of the model (for example, we do not know the initialization scheme, whether L2 regularization is used), it is difficult to find out why. Anyway, here’s what I got.

Training loss

We see that neither algorithm converges at a higher learning rate. In terms of training losses, a learning rate of 0.001 seemed to work best. Although Adam achieved the lowest training loss, AMSGrad training seems to be more stable (less varied), but I think more trials are needed to confirm this.

Verify the loss

The validation loss diverges badly in the configuration that achieves the best training loss. Both algorithms are similarly affected. It seems that the model based on the lower learning rate training is much more general. I’m going to be a buzzkill here, but the accuracy picture is going to make a big difference.

Training accuracy

The performance of training accuracy is similar to training loss.

Validation accuracy

This is the most interesting part. We’ve seen models where validation losses diverge badly, and validation accuracy is actually optimal. Models that seem to be more general in terms of validation losses are less general in terms of validation accuracy. It’s important to remember that.

Depending on the setup, Adam or AMSGrad beat each other by a tiny margin. More importantly, the model performed significantly worse than the current state-of-the-art models (only 78% accuracy), regardless of the optimization algorithm used.

CIFAR SmallVgg – 10

This is a smaller version of the model I used in the main part of this article. In each layer, half of the filters are used, and only two convolution layers are used per block. I will omit the interpretation of the results of these experiments, because they add nothing new.

Training loss

Verify the loss

Training accuracy

Validation accuracy

The original address: https://fdlm.github.io/post/amsgrad/