Single-machine multi-GPU training, time is not speeding up

zerolxf · November 15, 2018, 10:58am

Hello everyone!

I recently learned to use mxnet to implement multi-GPU training and found that multiple GPUs are not accelerated.

Running the code for example in github
source

Here I changed his kv_store type to “device”, but running eight GPUs is not nearly twice as fast as running four GPUs. I wonder why?

4 GPU run time
Epoch 0: Test_acc 0.528500 time 24.068163
Epoch 1: Test_acc 0.600200 time 18.431711
Epoch 2: Test_acc 0.622900 time 19.140585
Epoch 3: Test_acc 0.637000 time 18.778502
Epoch 4: Test_acc 0.670600 time 18.383272

8 GPU run time
Epoch 0: Test_acc 0.515800 time 22.551225
Epoch 1: Test_acc 0.574200 time 19.231086
Epoch 2: Test_acc 0.603300 time 16.836740
Epoch 3: Test_acc 0.557800 time 18.368619
Epoch 4: Test_acc 0.629300 time 17.656158

ps:
I now want to modify the all_reduce part of mxnet to do experiments.

Can we distribute the data evenly to 8 GPUs in advance, and then at each iteration, the GPU randomly fetches data from its own data set.
How should this implement the data sampler?
If I want to modify the model Parameter directly, will this implementation be slow? Is there any elegant implementation in Mxnet?

for ctx_param in param.list_data():
    ctx_param[:] = ctx_param[:]/ self.worker_num

zerolxf · November 15, 2018, 2:27pm

When I changed the example to run, I found that he did not speed up.
example from train_cifar10.py

run cmd && result

python train_cifar10.py --network resnet --num-layers INFO:root:Epoch[1] Batch [20] Speed: 1501.31 samples/sec INFO:root:Epoch[1] Batch [40] Speed: 1385.53 samples/sec INFO:root:Epoch[1] Batch [60] Speed: 1344.49 samples/sec INFO:root:Epoch[1] Batch [80] Speed: 1372.30 samples/sec INFO:root:Epoch[1] Batch [100] Speed: 1422.98 samples/sec INFO:root:Epoch[1] Batch [120] Speed: 1280.82 samples/sec INFO:root:Epoch[1] Batch [140] Speed: 1341.84 samples/sec INFO:root:Epoch[1] Batch [160] Speed: 1274.24 samples/sec INFO:root:Epoch[1] Batch [180] Speed: 1504.30 samples/sec INFO:root:Epoch[1] Batch [200] Speed: 1402.82 samples/sec INFO:root:Epoch[1] Batch [220] Speed: 1384.64 samples/sec INFO:root:Epoch[1] Batch [240] Speed: 1290.16 samples/sec INFO:root:Epoch[1] Batch [260] Speed: 1384.75 samples/sec INFO:root:Epoch[1] Batch [280] Speed: 1322.50 samples/sec INFO:root:Epoch[1] Batch [300] Speed: 1333.54 samples/sec INFO:root:Epoch[1] Batch [320] Speed: 1409.00 samples/sec INFO:root:Epoch[1] Batch [340] Speed: 1482.85 samples/sec INFO:root:Epoch[1] Batch [360] Speed: 1444.52 samples/sec INFO:root:Epoch[1] Batch [380] Speed: 1309.33 samples/sec INFO:root:Epoch[1] Train-accuracy=0.647656
INFO:root:Epoch[1] Time cost=36.435
INFO:root:Epoch[1] Validation-accuracy=0.635617 110 --batch-size 128 --gpus 0,1
accuracy=0.522693
accuracy=0.513281
accuracy=0.526953
accuracy=0.507031
accuracy=0.555078
accuracy=0.564844
accuracy=0.560547
accuracy=0.577344
accuracy=0.591406
accuracy=0.572656
accuracy=0.586328
accuracy=0.601172
accuracy=0.608203
accuracy=0.612500
accuracy=0.625391
accuracy=0.633203
accuracy=0.630469
accuracy=0.657813
accuracy=0.643359

run cmd && result

python train_cifar10.py --network resnet --num-layers 110 --batch-size 128 --gpus 0,1,2,3,4,5,6,7
INFO:root:Epoch[1] Batch [20] Speed: 1454.82 samples/sec accuracy=0.529018
INFO:root:Epoch[1] Batch [40] Speed: 1378.88 samples/sec accuracy=0.509766
INFO:root:Epoch[1] Batch [60] Speed: 1353.56 samples/sec accuracy=0.521094
INFO:root:Epoch[1] Batch [80] Speed: 1359.70 samples/sec accuracy=0.528906
INFO:root:Epoch[1] Batch [100] Speed: 1393.93 samples/sec accuracy=0.538672
INFO:root:Epoch[1] Batch [120] Speed: 1409.83 samples/sec accuracy=0.551562
INFO:root:Epoch[1] Batch [140] Speed: 1184.57 samples/sec accuracy=0.552344
INFO:root:Epoch[1] Batch [160] Speed: 1145.63 samples/sec accuracy=0.567187
INFO:root:Epoch[1] Batch [180] Speed: 1114.10 samples/sec accuracy=0.572656
INFO:root:Epoch[1] Batch [200] Speed: 1186.54 samples/sec accuracy=0.557031
INFO:root:Epoch[1] Batch [220] Speed: 1080.14 samples/sec accuracy=0.590234
INFO:root:Epoch[1] Batch [240] Speed: 1357.69 samples/sec accuracy=0.587109
INFO:root:Epoch[1] Batch [260] Speed: 1195.39 samples/sec accuracy=0.591016
INFO:root:Epoch[1] Batch [280] Speed: 1134.49 samples/sec accuracy=0.607422
INFO:root:Epoch[1] Batch [300] Speed: 1062.23 samples/sec accuracy=0.590625
INFO:root:Epoch[1] Batch [320] Speed: 1220.36 samples/sec accuracy=0.601562
INFO:root:Epoch[1] Batch [340] Speed: 1124.81 samples/sec accuracy=0.621484
INFO:root:Epoch[1] Batch [360] Speed: 1222.74 samples/sec accuracy=0.634375
INFO:root:Epoch[1] Batch [380] Speed: 1441.69 samples/sec accuracy=0.633984
INFO:root:Epoch[1] Train-accuracy=0.622656
INFO:root:Epoch[1] Time cost=40.206
INFO:root:Epoch[1] Validation-accuracy=0.619892

But when I run multi-GPU with pytorch, there is an acceleration effect.

ThomasDelteil · November 15, 2018, 8:26pm

You need to increase the batch-size by the number of GPU you are using, the batch-size is the total batch-size across your GPUs not per GPU, hence the lack of speed up, actually small slow down, because you are still doing the same number of iterations with your 2 GPUs than with your 8 GPUs. For example to compare head to head with your previous 2 GPUs run, try running with with --batch-size 512 and let me know how that goes. (PS if you are doing this actual training, don’t forget to increase your learning rate as well, your updates will be more confident since it will aggregate gradients from more example so you can increase your learning rate to learn faster)

For me:

python train_cifar10.py --network resnet --num-layers 110 --batch-size 256 --gpus 0,1
INFO:root:Epoch[0] Batch [0-20]     Speed: 3475.35 samples/sec	accuracy=0.163876
INFO:root:Epoch[0] Batch [20-40]	Speed: 3622.00 samples/sec	accuracy=0.275781
INFO:root:Epoch[0] Batch [40-60]	Speed: 3748.47 samples/sec	accuracy=0.339648
INFO:root:Epoch[0] Batch [60-80]	Speed: 3726.51 samples/sec	accuracy=0.375391
INFO:root:Epoch[0] Batch [80-100]	Speed: 3626.41 samples/sec	accuracy=0.398242
INFO:root:Epoch[0] Batch [100-120]	Speed: 3709.01 samples/sec	accuracy=0.409180
INFO:root:Epoch[0] Batch [120-140]	Speed: 3696.16 samples/sec	accuracy=0.435742
INFO:root:Epoch[0] Batch [140-160]	Speed: 3733.66 samples/sec	accuracy=0.458008
INFO:root:Epoch[0] Batch [160-180]	Speed: 3703.81 samples/sec	accuracy=0.477734

Doubling number of GPUs and batch-size

python train_cifar10.py --network resnet --num-layers 110 --batch-size 512 --gpus 0,1,2,3
INFO:root:Epoch[0] Batch [0-20]     Speed: 6259.34 samples/sec	accuracy=0.166667
INFO:root:Epoch[0] Batch [20-40]	Speed: 6329.66 samples/sec	accuracy=0.264355
INFO:root:Epoch[0] Batch [40-60]	Speed: 6358.90 samples/sec	accuracy=0.335254
INFO:root:Epoch[0] Batch [60-80]	Speed: 6174.45 samples/sec	accuracy=0.379980

zerolxf · November 16, 2018, 2:07am

Thanks.
Increasing the batch_size has an acceleration effect on the second sample, but the first one has already increased the batch_size.Something is wrong but I don’t know what?

ThomasDelteil · November 16, 2018, 4:44am

I am not sure what you mean sorry, can you rephrase?

zerolxf · November 16, 2018, 4:54am

Sorry.
Increasing batch_size has an acceleration effect on the second example train_cifar10.py ( from /example/image-classification).
But the first example cifar10_dist.py has no effect( from /example/distributed_training).

Topic		Replies	Views
Understanding MXNet multi-gpu performance Performance	7	1842	November 5, 2018
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2981	January 20, 2019
Multi system multi gpu distributed training slower than single system multi-gpu Performance	5	3427	December 22, 2021
Best practices for prediction on a machine with multiple GPUs	3	1190	November 8, 2017
How to speed up the train of neural network model with mxnet? Performance	12	3076	August 10, 2018

Single-machine multi-GPU training, time is not speeding up

Related Topics