Forward-backward pass being a bottleneck in multi-gpu training


I have a training in which I start to see forward-backward pass being a bottleneck for my multi-gpu training.

When I run it on two gpus, everything looks nice, as described below. When I put work to more, for example, 4 gpus, the time I spend in forward-backward scheduling starts to be longer than the computations itself.

To provide more details: I have profiled the training, data processing/loading is in the background, there is similar computation work for each of the gpus. Everything looks nice on the profiling graphs when run on 2 gpus. With 4 it is clear that the forward-backward scheduling, waiting this to finish, puts some gpus into idle. The model itself is quite complicated. For example, an autoregressive loop which does a dynamic loop around some blocks is involved.

I have a few questions:

  1. Multi-thread
    Could I perform forward-backward passes, for each gpu separately, in different threads of the same process?
    I saw some comment on the forum that autograd is not thread-safe.
    But, maybe it is, when each thread will be doing scheduling only for its dedicated gpu?
    Or, maybe there is some other update in this topic: mxnet/autograd thread-safeness?

  2. Multi-process
    If I cannot do forward-backward multi-threaded way, then, I guess, I could do it multiprocess way.
    Do I lose some performance because of that?
    Do the gpus communicate between different processes still through internal GPU peer-to-peer communication (I don’t know exactly what I am talking about), or through some more external, like MPI, protocol, then?
    If I want to run the training on, for example, 8 gpus, does it make sense to run 4 processes, each running on 2 gpus, or I should go directly/simply with 8 processes covering 1 gpu each.

  3. Model parallelization. Putting different parts of the graph into different gpus.
    Theoretically, I know what I would need to do to achieve that, but, are there any examples for running “model parallelization”-like training in mxnet, which I could look into?


Hi bart314,

There is actually a way to do what you want with multiple threads. gluon-nlp has a Parallelizable class that your model can extend and a Parallel class that you use in conjuction with your model. There’s an example of how you would do a forward pass in the documentation of the code linked.

You can use this directly if you’re training the model in python with gluon.

Copying @ThomasDelteil’s answer here for points 2 and 3 for greater visibility.

“wanted to take the time to run some experiment to give you more data points, but what I would recommend is trying horovod on a single node. horovod use nccl for GPU to GPU communication and each GPU is running its own process. for horovod this discussion might be helpful for you Distributed training questions”

Hi @bart314, something like this works for model parallelization. I haven’t done a realistic use case training run yet, but the toy example below runs and will give you an idea. I think the code is self-explanatory. The most important part is that you need separate trainers for different contexts (that is, different parts of the model living in different contexts), that you need to update. Then the rules I follow are these simple two:

  1. If model lives in ctx1 then it’s input must be brought to the same context before consuming it (use nd.as_in_context(...).
  2. You need a separate trainer for each different context that you use. Each of the trainers will store the part of the network parameters that live in that specific context. Each separate trainer needs to be updated independently.
import mxnet as mx
from mxnet import gluon, nd, autograd

ctx1 = mx.gpu(0)
ctx2 = mx.gpu(1)
myloss = gluon.loss.L2Loss()

# Model 1. This can be part of a larger model
net1 = gluon.nn.Conv2D(32,kernel_size=3,padding=1)
trainer1= gluon.Trainer(net1.collect_params(),'adam')

# Model 2. Same: it can be part of a larger model
net2 = gluon.nn.Conv2D(64,kernel_size=3,padding=1)
trainer2= gluon.Trainer(net2.collect_params(),'adam')

# Here I introduce an input, that lives in context1, and a ground truth label
# that lives in context 2. 
batch_size = 1
input_img = nd.random.uniform(shape=[batch_size,4,256,256],ctx=ctx1)
label_gt = nd.random.uniform(shape=[batch_size,64,256,256],ctx=ctx2)

# This is a a single training forward_backward and update
# The output of model part 1 is first copied to context of model part 2
# Then consumed by model part 2
with autograd.record():
    out1 = net1(input_img) # this is in ctx1
    out1 = out1.as_in_context(ctx2)
    out2 = net2(out1)# This is ctx2
    loss = myloss(out2,label_gt)
# Two separate updates for each part of the model. 

@indu has some amazing tutorials, check this out.

1 Like