Hi, a quick update on this for anyone who is interested. I’ve done some tests on the question between adam vs SGD , for a semantic segmentation problem using the isprs Potsdam dataset. To cut the (rather long) story sort, I am using a cyclical learning rate that looks like the following image (orange line):
where the peak, the frequency, as well as the geometric shape of the overall variation (blue line) is fine tuned via hyper parameter optimization (so the peak can be displaced left/right, the slope of how fast one reaches the maximum, goes to minimum are hyper parameters, the CLR can coincide with the blue line if that proves better for the optimization etc). LRmin/max were estimated as described in the fast.ai libarary (update lr in every iteration etc).
Now, I’ve run in this scheme both SGD as well as Adam, for training for 25 epochs, with the hope that the best result will continue training giving best validation loss. My findings are summarized in the following two plots. Each line corresponds to an independent run with a different set of hyperparameters. I’ve used GPyOpt for hyper param tuning.
The vertical y-axis is something like average IoU over all classes (not really, but you get the idea - the higher the better). The hyper parameters I optimized for both SGD and Adam where the CLR params (common for both) and in addition for SGD I optimized momentum, and for Adam beta1. This is for fixed batch size (32).
Adam performed better, resulting in an almost 2+% better “score” (something like average IoU). So my understanding so far (not conclusive result) is that SGD vs Adam for fixed batch size (no weight decay, am using data augmentation for regularization) depends on the dataset.
I am doing more tests on this, I’ll update this post if anything new appears.
edit: forgot to mention, the plots are for validation data “IoU” , not seen by the model during training.