I recently learned to use mxnet to implement multi-GPU training and found that multiple GPUs are not accelerated.
Running the code for example in github
Here I changed his kv_store type to “device”, but running eight GPUs is not nearly twice as fast as running four GPUs. I wonder why?
4 GPU run time
Epoch 0: Test_acc 0.528500 time 24.068163
Epoch 1: Test_acc 0.600200 time 18.431711
Epoch 2: Test_acc 0.622900 time 19.140585
Epoch 3: Test_acc 0.637000 time 18.778502
Epoch 4: Test_acc 0.670600 time 18.383272
8 GPU run time
Epoch 0: Test_acc 0.515800 time 22.551225
Epoch 1: Test_acc 0.574200 time 19.231086
Epoch 2: Test_acc 0.603300 time 16.836740
Epoch 3: Test_acc 0.557800 time 18.368619
Epoch 4: Test_acc 0.629300 time 17.656158
I now want to modify the all_reduce part of mxnet to do experiments.
Can we distribute the data evenly to 8 GPUs in advance, and then at each iteration, the GPU randomly fetches data from its own data set.
How should this implement the data sampler?
If I want to modify the model Parameter directly, will this implementation be slow? Is there any elegant implementation in Mxnet?
for ctx_param in param.list_data(): ctx_param[:] = ctx_param[:]/ self.worker_num