mx.mod.Module provides a convenient high level api for model training in python. But due to some reasons, I need to train my models in pure C++ environment. I am wondering if it is also possible to support multiple gpu device with the interfaces in
Currently, I can only find the
cpp-package/include/mxnet-cpp/executor.h (which supports single gpu training).
I am not a C++ binding expert, but looking through the API I don’t see either an obvious way of doing that out of the box. For example if you wanted to perform data parallelism (training multiple copy of the same model in parallel on each GPU, effectively allowing you to increase your overall batch size), you could proceed in the following way:
- Initializing your model on each GPU
- Splitting and copying your training data evenly on each GPU
- Passing the data batches forward
- Computing the gradients.
- Aggregating your gradients and updating your model weights on each GPU
Which is effectively what the module API is doing
Thank you for this information. I wanted to ask the same question than @nicklhy.
I tried to do as you said. Thus, I copied the networks with the initialized weights to the same values. (I then get the same grad_arrays value computed on the same training data batch). I then concatenated the grad_arrays values, that I then feed back in the parameters updaters with
opt->Update(i, exec1->arg_arrays[i], combinedGradArray1[i]);.
But unfortunately, that is not training on two GPUs, even if the basis model was training fine on one GPU. What could go wrong?
I got a better result by summing the grad_arrays values instead of concatenating them. But it raises a questions: how to handle different size of batch?
And still, what makes the concatenation of batch not working?