So, I’m trying to understand a question about how distributed training in MXNet really works (in order to optimize my code):
- I understand that Mu Li’s Paper is a good reference. But I couldn’t find an answer to this question-
Im training two networks, one hyper parameter (a regularization coefficient) apart. The first one is training on the first 8 GPU’s, and the second one on all 16 GPU’s. The samples per second are kind of the same, but the second one is slower:
First network:
INFO:root:Epoch[37] Batch [30256] Speed: 197771.84 samples/sec QLSumMetric=729.808672
INFO:root:Epoch[37] Batch [30744] Speed: 206046.96 samples/sec QLSumMetric=729.015916
INFO:root:Epoch[37] Batch [31232] Speed: 178224.72 samples/sec QLSumMetric=727.661732
INFO:root:Epoch[37] Batch [31720] Speed: 167553.20 samples/sec QLSumMetric=724.259217
INFO:root:Epoch[37] Batch [32208] Speed: 208376.83 samples/sec QLSumMetric=732.663074
INFO:root:Epoch[37] Batch [32696] Speed: 206057.11 samples/sec QLSumMetric=716.422353
INFO:root:Epoch[37] Batch [33184] Speed: 187960.58 samples/sec QLSumMetric=733.493551
INFO:root:Epoch[37] Batch [33672] Speed: 208166.71 samples/sec QLSumMetric=733.055588
INFO:root:Epoch[37] Batch [34160] Speed: 192618.61 samples/sec QLSumMetric=723.640843
Second network:
INFO:root:Epoch[13] Batch [11224] Speed: 101530.47 samples/sec QLSumMetric=736.330289
INFO:root:Epoch[13] Batch [11712] Speed: 105602.55 samples/sec QLSumMetric=734.239894
INFO:root:Epoch[13] Batch [12200] Speed: 104586.31 samples/sec QLSumMetric=744.775742
INFO:root:Epoch[13] Batch [12688] Speed: 107612.01 samples/sec QLSumMetric=743.912667
INFO:root:Epoch[13] Batch [13176] Speed: 106278.79 samples/sec QLSumMetric=738.141423
INFO:root:Epoch[13] Batch [13664] Speed: 105420.35 samples/sec QLSumMetric=736.574530
INFO:root:Epoch[13] Batch [14152] Speed: 101914.44 samples/sec QLSumMetric=741.377023
INFO:root:Epoch[13] Batch [14640] Speed: 106754.43 samples/sec QLSumMetric=744.095754
INFO:root:Epoch[13] Batch [15128] Speed: 106558.59 samples/sec QLSumMetric=744.974015
INFO:root:Epoch[13] Batch [15616] Speed: 104182.89 samples/sec QLSumMetric=736.113241
The second network is at about half the speed (the speeds themselves are fine, I just wanted to understand). gpustat printed below:
[0] Tesla K80 | 65'C, 52 % | 317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[1] Tesla K80 | 56'C, 52 % | 321 / 11439 MB | ubuntu(161M) ubuntu(153M)
[2] Tesla K80 | 76'C, 54 % | 328 / 11439 MB | ubuntu(168M) ubuntu(153M)
[3] Tesla K80 | 63'C, 56 % | 330 / 11439 MB | ubuntu(170M) ubuntu(153M)
[4] Tesla K80 | 67'C, 55 % | 317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[5] Tesla K80 | 54'C, 55 % | 317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[6] Tesla K80 | 69'C, 49 % | 317 / 11439 MB | ubuntu(157M) ubuntu(153M)
[7] Tesla K80 | 58'C, 51 % | 326 / 11439 MB | ubuntu(157M) ubuntu(162M)
[8] Tesla K80 | 56'C, 33 % | 176 / 11439 MB | ubuntu(172M)
[9] Tesla K80 | 47'C, 37 % | 179 / 11439 MB | ubuntu(175M)
[10] Tesla K80 | 63'C, 15 % | 157 / 11439 MB | ubuntu(153M)
[11] Tesla K80 | 52'C, 16 % | 157 / 11439 MB | ubuntu(153M)
[12] Tesla K80 | 59'C, 16 % | 159 / 11439 MB | ubuntu(153M)
[13] Tesla K80 | 49'C, 16 % | 157 / 11439 MB | ubuntu(153M)
[14] Tesla K80 | 64'C, 15 % | 157 / 11439 MB | ubuntu(153M)
[15] Tesla K80 | 54'C, 16 % | 157 / 11439 MB | ubuntu(153M)