Adam vs SGD + LR schedhules?

Dear all,

I just stumbled on some discussions and a paper today about that SGD yields better generalization result in comparison with adam/RMSProp etc and I was wondering what’s anyone’s opinion on that? What do you use? So far I’ve been using adam, and I’ve noticed that several times restarting it (even keeping LR fixed) improves performance. Usually I train models until convergence and then restart with decreasing LR. Has served me well so far.

To sum up: 1) Have you done the test? Adam vs SGD with momentum finely tuned (both)? 2) What is your prefereed LR scheduler? I’ll do some tests the following days and report results.

I found this discussion very interesting. See also this.

Cheers

Hi @feevos,

When working on models for CIFAR-10 I found a similar effect to that suggested by the paper. I found the training loss with Adam was comparable to that of using SGD with momentum, but the validation/test loss was slightly higher. With regards to schedule I also reduced my learning rate in steps (once the learning curve appeared to level out), usually by a factor of 10. I found a warm up in learning rate (e.g. increase linearly across 5 epochs) helped quite a lot, especially for large batch sizes as suggested in this paper.

Recently I came across an interesting method where the learning rate is increased and decreased cyclically (see here, and I’m just in the process of trying it out! Will let you know the results.

2 Likes

Hi, a quick update on this for anyone who is interested. I’ve done some tests on the question between adam vs SGD , for a semantic segmentation problem using the isprs Potsdam dataset. To cut the (rather long) story sort, I am using a cyclical learning rate that looks like the following image (orange line):
CycleLR_tests

where the peak, the frequency, as well as the geometric shape of the overall variation (blue line) is fine tuned via hyper parameter optimization (so the peak can be displaced left/right, the slope of how fast one reaches the maximum, goes to minimum are hyper parameters, the CLR can coincide with the blue line if that proves better for the optimization etc). LRmin/max were estimated as described in the fast.ai libarary (update lr in every iteration etc).

Now, I’ve run in this scheme both SGD as well as Adam, for training for 25 epochs, with the hope that the best result will continue training giving best validation loss. My findings are summarized in the following two plots. Each line corresponds to an independent run with a different set of hyperparameters. I’ve used GPyOpt for hyper param tuning.

training_32_batch_Adamtraining_32_batch_SGD

The vertical y-axis is something like average IoU over all classes (not really, but you get the idea - the higher the better). The hyper parameters I optimized for both SGD and Adam where the CLR params (common for both) and in addition for SGD I optimized momentum, and for Adam beta1. This is for fixed batch size (32).

Adam performed better, resulting in an almost 2+% better “score” (something like average IoU). So my understanding so far (not conclusive result) is that SGD vs Adam for fixed batch size (no weight decay, am using data augmentation for regularization) depends on the dataset.

I am doing more tests on this, I’ll update this post if anything new appears.
edit: forgot to mention, the plots are for validation data “IoU” , not seen by the model during training.