Very slow initialisation of GPU distributed training

Dear all,

I am facing extreme delays in mxnet initialization on distributed GPU training using horovod. Once I launch my code (for debugging, just 2 gpus), it populates the GPUs up to some memory level, and then it does not start training until after 30 minutes (yes, that is minutes, and am using 12 cpu cores per rank). The truth is that the computation graph of these latest models is very complicated, but I cannot believe that this can be the issue. I hybridize the gluon models (net and loss function) prior to training with net.hybridize(static_alloc=True,static_shape=True). The problem is not resolved by defining the cache as described in issue 3239.

Any pointers/help mostly appreciated.

Can you give more insight into your infrastructure?
Just to be sure it’s not an issue with your infra.
Which version mxnet?
What instance type? How many GPUs? etc

Is this a model specific issue?
Have you tried basic mnist/resnet/other popular models with Horovod? Has it worked well for those models?

Hi @ChaiBapchya, thank you very much for your reply.

The model is a building change detection model (like a siamese UNet) with semantic segmentation as the output. The building blocks are unfortunately complicated (in the sense complicated computation graph - and complication is bad :frowning: - I apologize for that). I am few weeks before submitting for publication and doing some last tests, I will be able to share code afterwards.

The problem is model dependent, everything works fine with standard models. Also, it seems the problem is not horovod dependent, because even a standard classification model (outside horovod), takes few minutes to launch - in contrast an identical network in backbone, with resnet building blocks launches almost immediately. I did this test yesterday.

I just exported the model into a json file (115733 lines), I don’t know if it gives more insight, but says in last 3 lines:

  ],                                                                     
  "heads": [[15266, 0, 0], [15231, 0, 0], [15203, 0, 0], [15270, 0, 0]], 
  "attrs": {"mxnet_version": ["int", 10600]}                             
}                                                                        

The environment is HPC local environment, I do my debugging tests by requesting 2 GPUs (P100), 12 processors (Xeon) per process, and 128GB of memory. It seems these models require a lot of CPU memory as well. mxnet version: cu101-1.6.0.dist-info, cuda 10.1.168

I can provide full system info, but I think the question is: can I load from a file/in memory to avoid going through this operation every time? I think it’s a GPU issue. When I load the models on cpu they fire up almost instantly.

I also get this warning, that I don’t know if it is relative:

In [9]: outs = config['net'](xx,xx)                                                                                                                                                  
[17:40:48] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.                                                                                            

Again, thank you very much for your time.
Regards

1 Like

Thanks for the details.

As mentioned, this issue of slow GPU initialization and large CPU memory footprint both seem to be to specific to this model…
So as a result, it would be tough to make more claims before getting details about the implementation.
I’d defer to someone else experienced in model building / GPU on the discuss forum to help out.

1 Like

Thank you very much for your reply @ChaiBapchya. I will post code once I submit the paper to get help/gain a deeper understanding.

Sure. Looking forward to your paper being published! Good luck.

1 Like

Dear Feevos, sorry I can’t help out of this issue. But I am interested in your paper, how can I find it when it’s published?

Dear @ChaiBapchya, thank you very much for your answers in previous communication. The model that takes a lot of time to start training when hybridizing it, can be found in the repository: https://github.com/feevos/ceecnet

from ceecnet.models.changedetection.mantis.mantis_dn import * 

nfilters_init=32 
depth=6 
NClasses=2 
norm_type='GroupNorm' 
norm_groups=4 
nheads_start=4 
model='CEECNetV2' 
upFuse=True 

net = mantis_dn_cmtsk(nfilters_init=nfilters_init,NClasses=NClasses, depth=depth, norm_type=norm_type,norm_groups=norm_groups,nheads_start=nheads_start,model=model,upFuse=upFuse) 
net.initialize()                                                                     
                                                   
depth:= 0, nfilters: 32, nheads::4, widths::1
depth:= 1, nfilters: 64, nheads::8, widths::1
depth:= 2, nfilters: 128, nheads::16, widths::1
depth:= 3, nfilters: 256, nheads::32, widths::1
depth:= 4, nfilters: 512, nheads::64, widths::1
depth:= 5, nfilters: 1024, nheads::128, widths::1
depth:= 6, nfilters: 512, nheads::128, widths::1
depth:= 7, nfilters: 256, nheads::64, widths::1
depth:= 8, nfilters: 128, nheads::32, widths::1
depth:= 9, nfilters: 64, nheads::16, widths::1
depth:= 10, nfilters: 32, nheads::8, widths::1

In [7]: from mxnet import nd                                                                                                                    

In [8]: xx = nd.random.uniform(shape=[3,3,256,256])                                                                                             

In [9]: outs = net(xx,xx)                                                                                                                       

In [10]: for out in outs: 
    ...:     print (out.shape) 
    ...:                                                                                                                                        
(3, 2, 256, 256)
(3, 2, 256, 256)
(3, 2, 256, 256)

if I hybridize this network with CUDA optimization, it takes ~ 1h to start training. And I think the size of it, does not justify the delay:
image

Thank you very much for your time,
Foivos

PS @Bumblebee269 I’ve replied in pm.