I have a training in which I start to see forward-backward pass being a bottleneck for my multi-gpu training.
When I run it on two gpus, everything looks nice, as described below. When I put work to more, for example, 4 gpus, the time I spend in forward-backward scheduling starts to be longer than the computations itself.
To provide more details: I have profiled the training, data processing/loading is in the background, there is similar computation work for each of the gpus. Everything looks nice on the profiling graphs when run on 2 gpus. With 4 it is clear that the forward-backward scheduling, waiting this to finish, puts some gpus into idle. The model itself is quite complicated. For example, an autoregressive loop which does a dynamic loop around some blocks is involved.
I have a few questions:
Could I perform forward-backward passes, for each gpu separately, in different threads of the same process?
I saw some comment on the forum that autograd is not thread-safe.
But, maybe it is, when each thread will be doing scheduling only for its dedicated gpu?
Or, maybe there is some other update in this topic: mxnet/autograd thread-safeness?
If I cannot do forward-backward multi-threaded way, then, I guess, I could do it multiprocess way.
Do I lose some performance because of that?
Do the gpus communicate between different processes still through internal GPU peer-to-peer communication (I don’t know exactly what I am talking about), or through some more external, like MPI, protocol, then?
If I want to run the training on, for example, 8 gpus, does it make sense to run 4 processes, each running on 2 gpus, or I should go directly/simply with 8 processes covering 1 gpu each.
Model parallelization. Putting different parts of the graph into different gpus.
Theoretically, I know what I would need to do to achieve that, but, are there any examples for running “model parallelization”-like training in mxnet, which I could look into?