Single-machine multi-GPU training, time is not speeding up


#1

Hello everyone!

I recently learned to use mxnet to implement multi-GPU training and found that multiple GPUs are not accelerated.

Running the code for example in github
source

Here I changed his kv_store type to “device”, but running eight GPUs is not nearly twice as fast as running four GPUs. I wonder why?

4 GPU run time
Epoch 0: Test_acc 0.528500 time 24.068163
Epoch 1: Test_acc 0.600200 time 18.431711
Epoch 2: Test_acc 0.622900 time 19.140585
Epoch 3: Test_acc 0.637000 time 18.778502
Epoch 4: Test_acc 0.670600 time 18.383272

8 GPU run time
Epoch 0: Test_acc 0.515800 time 22.551225
Epoch 1: Test_acc 0.574200 time 19.231086
Epoch 2: Test_acc 0.603300 time 16.836740
Epoch 3: Test_acc 0.557800 time 18.368619
Epoch 4: Test_acc 0.629300 time 17.656158

ps:
I now want to modify the all_reduce part of mxnet to do experiments.

  1. Can we distribute the data evenly to 8 GPUs in advance, and then at each iteration, the GPU randomly fetches data from its own data set.
    How should this implement the data sampler?

  2. If I want to modify the model Parameter directly, will this implementation be slow? Is there any elegant implementation in Mxnet?

for ctx_param in param.list_data():
    ctx_param[:] = ctx_param[:]/ self.worker_num

#2

When I changed the example to run, I found that he did not speed up.
example from train_cifar10.py

run cmd && result

python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1
INFO:root:Epoch[1] Batch [20] Speed: 1501.31 samples/sec accuracy=0.522693
INFO:root:Epoch[1] Batch [40] Speed: 1385.53 samples/sec accuracy=0.513281
INFO:root:Epoch[1] Batch [60] Speed: 1344.49 samples/sec accuracy=0.526953
INFO:root:Epoch[1] Batch [80] Speed: 1372.30 samples/sec accuracy=0.507031
INFO:root:Epoch[1] Batch [100] Speed: 1422.98 samples/sec accuracy=0.555078
INFO:root:Epoch[1] Batch [120] Speed: 1280.82 samples/sec accuracy=0.564844
INFO:root:Epoch[1] Batch [140] Speed: 1341.84 samples/sec accuracy=0.560547
INFO:root:Epoch[1] Batch [160] Speed: 1274.24 samples/sec accuracy=0.577344
INFO:root:Epoch[1] Batch [180] Speed: 1504.30 samples/sec accuracy=0.591406
INFO:root:Epoch[1] Batch [200] Speed: 1402.82 samples/sec accuracy=0.572656
INFO:root:Epoch[1] Batch [220] Speed: 1384.64 samples/sec accuracy=0.586328
INFO:root:Epoch[1] Batch [240] Speed: 1290.16 samples/sec accuracy=0.601172
INFO:root:Epoch[1] Batch [260] Speed: 1384.75 samples/sec accuracy=0.608203
INFO:root:Epoch[1] Batch [280] Speed: 1322.50 samples/sec accuracy=0.612500
INFO:root:Epoch[1] Batch [300] Speed: 1333.54 samples/sec accuracy=0.625391
INFO:root:Epoch[1] Batch [320] Speed: 1409.00 samples/sec accuracy=0.633203
INFO:root:Epoch[1] Batch [340] Speed: 1482.85 samples/sec accuracy=0.630469
INFO:root:Epoch[1] Batch [360] Speed: 1444.52 samples/sec accuracy=0.657813
INFO:root:Epoch[1] Batch [380] Speed: 1309.33 samples/sec accuracy=0.643359
INFO:root:Epoch[1] Train-accuracy=0.647656
INFO:root:Epoch[1] Time cost=36.435
INFO:root:Epoch[1] Validation-accuracy=0.635617

run cmd && result

python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3,4,5,6,7
INFO:root:Epoch[1] Batch [20] Speed: 1454.82 samples/sec accuracy=0.529018
INFO:root:Epoch[1] Batch [40] Speed: 1378.88 samples/sec accuracy=0.509766
INFO:root:Epoch[1] Batch [60] Speed: 1353.56 samples/sec accuracy=0.521094
INFO:root:Epoch[1] Batch [80] Speed: 1359.70 samples/sec accuracy=0.528906
INFO:root:Epoch[1] Batch [100] Speed: 1393.93 samples/sec accuracy=0.538672
INFO:root:Epoch[1] Batch [120] Speed: 1409.83 samples/sec accuracy=0.551562
INFO:root:Epoch[1] Batch [140] Speed: 1184.57 samples/sec accuracy=0.552344
INFO:root:Epoch[1] Batch [160] Speed: 1145.63 samples/sec accuracy=0.567187
INFO:root:Epoch[1] Batch [180] Speed: 1114.10 samples/sec accuracy=0.572656
INFO:root:Epoch[1] Batch [200] Speed: 1186.54 samples/sec accuracy=0.557031
INFO:root:Epoch[1] Batch [220] Speed: 1080.14 samples/sec accuracy=0.590234
INFO:root:Epoch[1] Batch [240] Speed: 1357.69 samples/sec accuracy=0.587109
INFO:root:Epoch[1] Batch [260] Speed: 1195.39 samples/sec accuracy=0.591016
INFO:root:Epoch[1] Batch [280] Speed: 1134.49 samples/sec accuracy=0.607422
INFO:root:Epoch[1] Batch [300] Speed: 1062.23 samples/sec accuracy=0.590625
INFO:root:Epoch[1] Batch [320] Speed: 1220.36 samples/sec accuracy=0.601562
INFO:root:Epoch[1] Batch [340] Speed: 1124.81 samples/sec accuracy=0.621484
INFO:root:Epoch[1] Batch [360] Speed: 1222.74 samples/sec accuracy=0.634375
INFO:root:Epoch[1] Batch [380] Speed: 1441.69 samples/sec accuracy=0.633984
INFO:root:Epoch[1] Train-accuracy=0.622656
INFO:root:Epoch[1] Time cost=40.206
INFO:root:Epoch[1] Validation-accuracy=0.619892

But when I run multi-GPU with pytorch, there is an acceleration effect.


#3

You need to increase the batch-size by the number of GPU you are using, the batch-size is the total batch-size across your GPUs not per GPU, hence the lack of speed up, actually small slow down, because you are still doing the same number of iterations with your 2 GPUs than with your 8 GPUs. For example to compare head to head with your previous 2 GPUs run, try running with with --batch-size 512 and let me know how that goes. (PS if you are doing this actual training, don’t forget to increase your learning rate as well, your updates will be more confident since it will aggregate gradients from more example so you can increase your learning rate to learn faster)

For me:

python train_cifar10.py --network resnet --num-layers 110 --batch-size 256 --gpus 0,1
INFO:root:Epoch[0] Batch [0-20]     Speed: 3475.35 samples/sec	accuracy=0.163876
INFO:root:Epoch[0] Batch [20-40]	Speed: 3622.00 samples/sec	accuracy=0.275781
INFO:root:Epoch[0] Batch [40-60]	Speed: 3748.47 samples/sec	accuracy=0.339648
INFO:root:Epoch[0] Batch [60-80]	Speed: 3726.51 samples/sec	accuracy=0.375391
INFO:root:Epoch[0] Batch [80-100]	Speed: 3626.41 samples/sec	accuracy=0.398242
INFO:root:Epoch[0] Batch [100-120]	Speed: 3709.01 samples/sec	accuracy=0.409180
INFO:root:Epoch[0] Batch [120-140]	Speed: 3696.16 samples/sec	accuracy=0.435742
INFO:root:Epoch[0] Batch [140-160]	Speed: 3733.66 samples/sec	accuracy=0.458008
INFO:root:Epoch[0] Batch [160-180]	Speed: 3703.81 samples/sec	accuracy=0.477734

Doubling number of GPUs and batch-size

python train_cifar10.py --network resnet --num-layers 110 --batch-size 512 --gpus 0,1,2,3
INFO:root:Epoch[0] Batch [0-20]     Speed: 6259.34 samples/sec	accuracy=0.166667
INFO:root:Epoch[0] Batch [20-40]	Speed: 6329.66 samples/sec	accuracy=0.264355
INFO:root:Epoch[0] Batch [40-60]	Speed: 6358.90 samples/sec	accuracy=0.335254
INFO:root:Epoch[0] Batch [60-80]	Speed: 6174.45 samples/sec	accuracy=0.379980

#4

Thanks.
Increasing the batch_size has an acceleration effect on the second sample, but the first one has already increased the batch_size.Something is wrong but I don’t know what?


#5

I am not sure what you mean sorry, can you rephrase?


#6

Sorry.
Increasing batch_size has an acceleration effect on the second example train_cifar10.py ( from /example/image-classification).
But the first example cifar10_dist.py has no effect( from /example/distributed_training).