Understanding MXNet multi-gpu performance

dmadeka · October 9, 2017, 12:45am

So, I’m trying to understand a question about how distributed training in MXNet really works (in order to optimize my code):

I understand that Mu Li’s Paper is a good reference. But I couldn’t find an answer to this question-

Im training two networks, one hyper parameter (a regularization coefficient) apart. The first one is training on the first 8 GPU’s, and the second one on all 16 GPU’s. The samples per second are kind of the same, but the second one is slower:

First network:

INFO:root:Epoch[37] Batch [30256]	Speed: 197771.84 samples/sec	QLSumMetric=729.808672
INFO:root:Epoch[37] Batch [30744]	Speed: 206046.96 samples/sec	QLSumMetric=729.015916
INFO:root:Epoch[37] Batch [31232]	Speed: 178224.72 samples/sec	QLSumMetric=727.661732
INFO:root:Epoch[37] Batch [31720]	Speed: 167553.20 samples/sec	QLSumMetric=724.259217
INFO:root:Epoch[37] Batch [32208]	Speed: 208376.83 samples/sec	QLSumMetric=732.663074
INFO:root:Epoch[37] Batch [32696]	Speed: 206057.11 samples/sec	QLSumMetric=716.422353
INFO:root:Epoch[37] Batch [33184]	Speed: 187960.58 samples/sec	QLSumMetric=733.493551
INFO:root:Epoch[37] Batch [33672]	Speed: 208166.71 samples/sec	QLSumMetric=733.055588
INFO:root:Epoch[37] Batch [34160]	Speed: 192618.61 samples/sec	QLSumMetric=723.640843

Second network:

INFO:root:Epoch[13] Batch [11224]	Speed: 101530.47 samples/sec	QLSumMetric=736.330289
INFO:root:Epoch[13] Batch [11712]	Speed: 105602.55 samples/sec	QLSumMetric=734.239894
INFO:root:Epoch[13] Batch [12200]	Speed: 104586.31 samples/sec	QLSumMetric=744.775742
INFO:root:Epoch[13] Batch [12688]	Speed: 107612.01 samples/sec	QLSumMetric=743.912667
INFO:root:Epoch[13] Batch [13176]	Speed: 106278.79 samples/sec	QLSumMetric=738.141423
INFO:root:Epoch[13] Batch [13664]	Speed: 105420.35 samples/sec	QLSumMetric=736.574530
INFO:root:Epoch[13] Batch [14152]	Speed: 101914.44 samples/sec	QLSumMetric=741.377023
INFO:root:Epoch[13] Batch [14640]	Speed: 106754.43 samples/sec	QLSumMetric=744.095754
INFO:root:Epoch[13] Batch [15128]	Speed: 106558.59 samples/sec	QLSumMetric=744.974015
INFO:root:Epoch[13] Batch [15616]	Speed: 104182.89 samples/sec	QLSumMetric=736.113241

The second network is at about half the speed (the speeds themselves are fine, I just wanted to understand). gpustat printed below:

[0] Tesla K80        | 65'C,  52 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[1] Tesla K80        | 56'C,  52 % |   321 / 11439 MB | ubuntu(161M) ubuntu(153M)
[2] Tesla K80        | 76'C,  54 % |   328 / 11439 MB | ubuntu(168M) ubuntu(153M)
[3] Tesla K80        | 63'C,  56 % |   330 / 11439 MB | ubuntu(170M) ubuntu(153M)
[4] Tesla K80        | 67'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[5] Tesla K80        | 54'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[6] Tesla K80        | 69'C,  49 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[7] Tesla K80        | 58'C,  51 % |   326 / 11439 MB | ubuntu(157M) ubuntu(162M)
[8] Tesla K80        | 56'C,  33 % |   176 / 11439 MB | ubuntu(172M)
[9] Tesla K80        | 47'C,  37 % |   179 / 11439 MB | ubuntu(175M)
[10] Tesla K80        | 63'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[11] Tesla K80        | 52'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[12] Tesla K80        | 59'C,  16 % |   159 / 11439 MB | ubuntu(153M)
[13] Tesla K80        | 49'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[14] Tesla K80        | 64'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[15] Tesla K80        | 54'C,  16 % |   157 / 11439 MB | ubuntu(153M)

tornadomeet · October 9, 2017, 1:35am

what the batch-size of first and second network, and what’s your model?

piiswrong · October 9, 2017, 1:38am

20k samples per second suggests your network is way too small to be trained on multiple gpus.

You need a network at least as big as Alexnet to benefit from multi-gpu training

dmadeka · October 9, 2017, 1:39am

Both networks have a batch size of 2048, it’s a 215 Input x512x512x570 MLP

tornadomeet · October 9, 2017, 2:15am

if 8 gpu use 2048 batch-size, then 16gpu you should use 4096 to compare the performance.
as @piiswrong said, your model is too small to test multi-gpu performance.

smolix · October 9, 2017, 3:23am

Just rephrasing this - if you have such high sample/s throughput, you’re really pushing the limits of the python frontend. Try the same setup with a problem where the GPUs have some real work to do (rather than just shuffling data to/from the GPU). Also note that the P2.8xlarge and P2.16xlarge the PCI express bus behaves a bit differently since in the latter case all 16 GPUs are sharing one CPU. This might also influence it but the main issue is that your problem is ‘too simple’.

dmadeka · October 9, 2017, 5:01am

Understood. Though I did see better performance with multiple GPUs than with a single one - which caps off at 110-115K samples/second (vs 200K for 8 GPUs).

Thanks for the responses and help! This forum is a great idea!

olivcruche · November 5, 2018, 12:00pm

Hi @piiswrong I am interested in this point “You need a network at least as big as Alexnet to benefit from multi-gpu training”: what metrics would you use (both from GPU and model) in order to know qualitatively if multi-GPI training is relevant for a model?

Topic		Replies	Views
Single-machine multi-GPU training, time is not speeding up Gluon	5	2158	November 16, 2018
Best practices for prediction on a machine with multiple GPUs	3	1188	November 8, 2017
Multi system multi gpu distributed training slower than single system multi-gpu Performance	5	3424	December 22, 2021
About the Performance category Performance	0	694	October 9, 2017
Documentation Request: Model Parallelism Tutorial Performance	6	1841	March 10, 2018

Understanding MXNet multi-gpu performance

Related Topics