Very slow initialisation of GPU distributed training

feevos · June 26, 2020, 3:04pm

Dear all,

I am facing extreme delays in mxnet initialization on distributed GPU training using horovod. Once I launch my code (for debugging, just 2 gpus), it populates the GPUs up to some memory level, and then it does not start training until after 30 minutes (yes, that is minutes, and am using 12 cpu cores per rank). The truth is that the computation graph of these latest models is very complicated, but I cannot believe that this can be the issue. I hybridize the gluon models (net and loss function) prior to training with net.hybridize(static_alloc=True,static_shape=True). The problem is not resolved by defining the cache as described in issue 3239.

Any pointers/help mostly appreciated.

ChaiBapchya · June 27, 2020, 5:45am

Can you give more insight into your infrastructure?
Just to be sure it’s not an issue with your infra.
Which version mxnet?
What instance type? How many GPUs? etc

Is this a model specific issue?
Have you tried basic mnist/resnet/other popular models with Horovod? Has it worked well for those models?

feevos · June 27, 2020, 9:47am

Hi @ChaiBapchya, thank you very much for your reply.

The model is a building change detection model (like a siamese UNet) with semantic segmentation as the output. The building blocks are unfortunately complicated (in the sense complicated computation graph - and complication is bad - I apologize for that). I am few weeks before submitting for publication and doing some last tests, I will be able to share code afterwards.

The problem is model dependent, everything works fine with standard models. Also, it seems the problem is not horovod dependent, because even a standard classification model (outside horovod), takes few minutes to launch - in contrast an identical network in backbone, with resnet building blocks launches almost immediately. I did this test yesterday.

I just exported the model into a json file (115733 lines), I don’t know if it gives more insight, but says in last 3 lines:

  ],                                                                     
  "heads": [[15266, 0, 0], [15231, 0, 0], [15203, 0, 0], [15270, 0, 0]], 
  "attrs": {"mxnet_version": ["int", 10600]}                             
}

The environment is HPC local environment, I do my debugging tests by requesting 2 GPUs (P100), 12 processors (Xeon) per process, and 128GB of memory. It seems these models require a lot of CPU memory as well. mxnet version: cu101-1.6.0.dist-info, cuda 10.1.168

I can provide full system info, but I think the question is: can I load from a file/in memory to avoid going through this operation every time? I think it’s a GPU issue. When I load the models on cpu they fire up almost instantly.

I also get this warning, that I don’t know if it is relative:

In [9]: outs = config['net'](xx,xx)                                                                                                                                                  
[17:40:48] src/imperative/cached_op.cc:192: Disabling fusion due to altered topological order of inputs.

Again, thank you very much for your time.
Regards

ChaiBapchya · June 28, 2020, 5:40pm

Thanks for the details.

As mentioned, this issue of slow GPU initialization and large CPU memory footprint both seem to be to specific to this model…
So as a result, it would be tough to make more claims before getting details about the implementation.
I’d defer to someone else experienced in model building / GPU on the discuss forum to help out.

feevos · June 28, 2020, 5:53pm

Thank you very much for your reply @ChaiBapchya. I will post code once I submit the paper to get help/gain a deeper understanding.

ChaiBapchya · June 29, 2020, 5:34pm

Sure. Looking forward to your paper being published! Good luck.

Bumblebee269 · June 30, 2020, 6:26am

Dear Feevos, sorry I can’t help out of this issue. But I am interested in your paper, how can I find it when it’s published?

feevos · September 7, 2020, 3:33am

Dear @ChaiBapchya, thank you very much for your answers in previous communication. The model that takes a lot of time to start training when hybridizing it, can be found in the repository: https://github.com/feevos/ceecnet

from ceecnet.models.changedetection.mantis.mantis_dn import * 

nfilters_init=32 
depth=6 
NClasses=2 
norm_type='GroupNorm' 
norm_groups=4 
nheads_start=4 
model='CEECNetV2' 
upFuse=True 

net = mantis_dn_cmtsk(nfilters_init=nfilters_init,NClasses=NClasses, depth=depth, norm_type=norm_type,norm_groups=norm_groups,nheads_start=nheads_start,model=model,upFuse=upFuse) 
net.initialize()                                                                     
                                                   
depth:= 0, nfilters: 32, nheads::4, widths::1
depth:= 1, nfilters: 64, nheads::8, widths::1
depth:= 2, nfilters: 128, nheads::16, widths::1
depth:= 3, nfilters: 256, nheads::32, widths::1
depth:= 4, nfilters: 512, nheads::64, widths::1
depth:= 5, nfilters: 1024, nheads::128, widths::1
depth:= 6, nfilters: 512, nheads::128, widths::1
depth:= 7, nfilters: 256, nheads::64, widths::1
depth:= 8, nfilters: 128, nheads::32, widths::1
depth:= 9, nfilters: 64, nheads::16, widths::1
depth:= 10, nfilters: 32, nheads::8, widths::1

In [7]: from mxnet import nd                                                                                                                    

In [8]: xx = nd.random.uniform(shape=[3,3,256,256])                                                                                             

In [9]: outs = net(xx,xx)                                                                                                                       

In [10]: for out in outs: 
    ...:     print (out.shape) 
    ...:                                                                                                                                        
(3, 2, 256, 256)
(3, 2, 256, 256)
(3, 2, 256, 256)

if I hybridize this network with CUDA optimization, it takes ~ 1h to start training. And I think the size of it, does not justify the delay:

Thank you very much for your time,
Foivos

PS @Bumblebee269 I’ve replied in pm.

Topic		Replies	Views
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2977	January 20, 2019
Gluon implementation much slower than Symbolic Performance	9	1700	August 20, 2018
Single-machine multi-GPU training, time is not speeding up Gluon	5	2162	November 16, 2018
Horovod has arrived? Performance	9	2515	April 8, 2019
Distributed Gluon HybridBlock is much much slower than Symbol	2	860	December 20, 2017

Very slow initialisation of GPU distributed training

Related Topics