Hi, I’m facing an issue in training a denoising model with L1Loss. The larger batch size I set, the much slower convergence I get. How could I deal with this problem?
The relation between convergence speed and batch size is a open research topic with lots of debates. Take a look into this Twitter thread - https://twitter.com/ylecun/status/989610208497360896?lang=en Notice, that the author is Yann LeCunn. There are multiple good papers recommended in that thread, and I haven’t read all of them, but this one looks quite interesting: https://arxiv.org/pdf/1711.00489.pdf
Thanks a lot! I’m already aware of this batch size problem. But with the same model and same setting, my Pytorch version converges really fast, the only problem with this version is its speed and memory consumption. While MxNet version is much faster and takes less than half amount of memory, the training phase seems stuck. When I test it with small batch size and a small portion of my dataset, everything is normal.
Could you please send a minimal reproducible example to show the problem?
It turns out my stupid mistake. To test the training phase, I used batch size of 64 with 50 iterations so the loss wouldn’t decrease. When I tried to train with the whole dataset of 500k images on 8 V100 GPUs with batch size of 64 x 8, training phase progressed normally as I expected.