Adam vs SGD + LR schedhules?


Dear all,

I just stumbled on some discussions and a paper today about that SGD yields better generalization result in comparison with adam/RMSProp etc and I was wondering what’s anyone’s opinion on that? What do you use? So far I’ve been using adam, and I’ve noticed that several times restarting it (even keeping LR fixed) improves performance. Usually I train models until convergence and then restart with decreasing LR. Has served me well so far.

To sum up: 1) Have you done the test? Adam vs SGD with momentum finely tuned (both)? 2) What is your prefereed LR scheduler? I’ll do some tests the following days and report results.

I found this discussion very interesting. See also this.



Hi @feevos,

When working on models for CIFAR-10 I found a similar effect to that suggested by the paper. I found the training loss with Adam was comparable to that of using SGD with momentum, but the validation/test loss was slightly higher. With regards to schedule I also reduced my learning rate in steps (once the learning curve appeared to level out), usually by a factor of 10. I found a warm up in learning rate (e.g. increase linearly across 5 epochs) helped quite a lot, especially for large batch sizes as suggested in this paper.

Recently I came across an interesting method where the learning rate is increased and decreased cyclically (see here, and I’m just in the process of trying it out! Will let you know the results.