I just stumbled on some discussions and a paper today about that SGD yields better generalization result in comparison with adam/RMSProp etc and I was wondering what’s anyone’s opinion on that? What do you use? So far I’ve been using adam, and I’ve noticed that several times restarting it (even keeping LR fixed) improves performance. Usually I train models until convergence and then restart with decreasing LR. Has served me well so far.
To sum up: 1) Have you done the test? Adam vs SGD with momentum finely tuned (both)? 2) What is your prefereed LR scheduler? I’ll do some tests the following days and report results.