How to choose learning rate for multi-node training


#1

Hi All,

I am using MXNet to train resnet50 100 epochs using ImageNet 2012 dataset. When I used 8 nodes where each node has 4 V100 GPUs, and I used the default learning rate 0.1, then the training has no progress. The top-1 train accuracy is always ~0.1% and top-5 train accuracy is always ~0.5%. I also tried larger learning rate 0.4 but still has the same issue. The --kv-store = dist_device_sync.

Then I used 4 V100 GPUs within a node, and I still use he default learning rate 0.1. As a result, I got 89.45% top-1 train accuracy and 97.39% top-5 train accuracy. The --kv-store=device.

So how to choose the learning rate when using multi-node? Does anyone have the same issue and know the solution? Thanks.

Regards.
Rengan


#2

Hi, are you modifying the batch size when running on a single node - i.e. how many total data samples are you using for the update (the same as in distributed training)?

If you do change Nbatch (e.g. by keeping all params the same, but reducing the number of nodes), it can affect the training process, see this for the ratio of learning_rate/batch_size : https://research.fb.com/publications/accurate-large-minibatch-sgd-training-imagenet-in-1-hour/