Understanding MXNet multi-gpu performance


#1

So, I’m trying to understand a question about how distributed training in MXNet really works (in order to optimize my code):

  • I understand that Mu Li’s Paper is a good reference. But I couldn’t find an answer to this question-

Im training two networks, one hyper parameter (a regularization coefficient) apart. The first one is training on the first 8 GPU’s, and the second one on all 16 GPU’s. The samples per second are kind of the same, but the second one is slower:

First network:

INFO:root:Epoch[37] Batch [30256]	Speed: 197771.84 samples/sec	QLSumMetric=729.808672
INFO:root:Epoch[37] Batch [30744]	Speed: 206046.96 samples/sec	QLSumMetric=729.015916
INFO:root:Epoch[37] Batch [31232]	Speed: 178224.72 samples/sec	QLSumMetric=727.661732
INFO:root:Epoch[37] Batch [31720]	Speed: 167553.20 samples/sec	QLSumMetric=724.259217
INFO:root:Epoch[37] Batch [32208]	Speed: 208376.83 samples/sec	QLSumMetric=732.663074
INFO:root:Epoch[37] Batch [32696]	Speed: 206057.11 samples/sec	QLSumMetric=716.422353
INFO:root:Epoch[37] Batch [33184]	Speed: 187960.58 samples/sec	QLSumMetric=733.493551
INFO:root:Epoch[37] Batch [33672]	Speed: 208166.71 samples/sec	QLSumMetric=733.055588
INFO:root:Epoch[37] Batch [34160]	Speed: 192618.61 samples/sec	QLSumMetric=723.640843

Second network:

INFO:root:Epoch[13] Batch [11224]	Speed: 101530.47 samples/sec	QLSumMetric=736.330289
INFO:root:Epoch[13] Batch [11712]	Speed: 105602.55 samples/sec	QLSumMetric=734.239894
INFO:root:Epoch[13] Batch [12200]	Speed: 104586.31 samples/sec	QLSumMetric=744.775742
INFO:root:Epoch[13] Batch [12688]	Speed: 107612.01 samples/sec	QLSumMetric=743.912667
INFO:root:Epoch[13] Batch [13176]	Speed: 106278.79 samples/sec	QLSumMetric=738.141423
INFO:root:Epoch[13] Batch [13664]	Speed: 105420.35 samples/sec	QLSumMetric=736.574530
INFO:root:Epoch[13] Batch [14152]	Speed: 101914.44 samples/sec	QLSumMetric=741.377023
INFO:root:Epoch[13] Batch [14640]	Speed: 106754.43 samples/sec	QLSumMetric=744.095754
INFO:root:Epoch[13] Batch [15128]	Speed: 106558.59 samples/sec	QLSumMetric=744.974015
INFO:root:Epoch[13] Batch [15616]	Speed: 104182.89 samples/sec	QLSumMetric=736.113241

The second network is at about half the speed (the speeds themselves are fine, I just wanted to understand). gpustat printed below:

[0] Tesla K80        | 65'C,  52 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[1] Tesla K80        | 56'C,  52 % |   321 / 11439 MB | ubuntu(161M) ubuntu(153M)
[2] Tesla K80        | 76'C,  54 % |   328 / 11439 MB | ubuntu(168M) ubuntu(153M)
[3] Tesla K80        | 63'C,  56 % |   330 / 11439 MB | ubuntu(170M) ubuntu(153M)
[4] Tesla K80        | 67'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[5] Tesla K80        | 54'C,  55 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[6] Tesla K80        | 69'C,  49 % |   317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[7] Tesla K80        | 58'C,  51 % |   326 / 11439 MB | ubuntu(157M) ubuntu(162M)
[8] Tesla K80        | 56'C,  33 % |   176 / 11439 MB | ubuntu(172M)
[9] Tesla K80        | 47'C,  37 % |   179 / 11439 MB | ubuntu(175M)
[10] Tesla K80        | 63'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[11] Tesla K80        | 52'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[12] Tesla K80        | 59'C,  16 % |   159 / 11439 MB | ubuntu(153M)
[13] Tesla K80        | 49'C,  16 % |   157 / 11439 MB | ubuntu(153M)
[14] Tesla K80        | 64'C,  15 % |   157 / 11439 MB | ubuntu(153M)
[15] Tesla K80        | 54'C,  16 % |   157 / 11439 MB | ubuntu(153M)

#2

what the batch-size of first and second network, and what’s your model?


#3

20k samples per second suggests your network is way too small to be trained on multiple gpus.

You need a network at least as big as Alexnet to benefit from multi-gpu training


#4

Both networks have a batch size of 2048, it’s a 215 Input x512x512x570 MLP


#5
  1. if 8 gpu use 2048 batch-size, then 16gpu you should use 4096 to compare the performance.
  2. as @piiswrong said, your model is too small to test multi-gpu performance.

#6

Just rephrasing this - if you have such high sample/s throughput, you’re really pushing the limits of the python frontend. Try the same setup with a problem where the GPUs have some real work to do (rather than just shuffling data to/from the GPU). Also note that the P2.8xlarge and P2.16xlarge the PCI express bus behaves a bit differently since in the latter case all 16 GPUs are sharing one CPU. This might also influence it but the main issue is that your problem is ‘too simple’.


#7

Understood. Though I did see better performance with multiple GPUs than with a single one - which caps off at 110-115K samples/second (vs 200K for 8 GPUs).

Thanks for the responses and help! This forum is a great idea!