I am getting a weird error (am not sure if it’s cuda/mxnet bug?), perhaps someone can help. I have a code for running a semantic segmentation problem, and it runs fine on HPC cluster - single node - when I use 4 GPUs (all of nodes GPUs). However, if I request less than 4 gpus, I get the following error:
terminate called after throwing an instance of 'dmlc::Error' what(): [14:46:44] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: invalid device ordinal Stack trace returned 9 entries: [bt] (0) /home/dia021/Software/mxnet/libmxnet.so(+0x2d7c72) [0x2aaab329fc72] [bt] (1) /home/dia021/Software/mxnet/libmxnet.so(+0x2d8248) [0x2aaab32a0248] [bt] (2) /home/dia021/Software/mxnet/libmxnet.so(+0x2695bd0) [0x2aaab565dbd0] [bt] (3) /home/dia021/Software/mxnet/libmxnet.so(+0x269d1d8) [0x2aaab56651d8] [bt] (4) /home/dia021/Software/mxnet/libmxnet.so(+0x269d45e) [0x2aaab566545e] [bt] (5) /home/dia021/Software/mxnet/libmxnet.so(+0x269767b) [0x2aaab565f67b] [bt] (6) /home/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x2aaae5ff2c5c] [bt] (7) /lib64/libpthread.so.0(+0x8744) [0x2aaaaacd6744] [bt] (8) /lib64/libc.so.6(clone+0x6d) [0x2aaaaafd4aad] Aborted (core dumped)
Inside my code (I mainly follow the gluon example here), I define
ctx in the following way:
gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] ctx = [mx.gpu(i) for i in gpus]
definition of my NN (ResUNet_d5 is something like UNet with skip connections and inception-like modules - well tested):
mynet = ResUNet_d5(_nfilters_init=nfilters_init, _NClasses = NClasses) mynet.collect_params().initialize(mx.initializer.Xavier(),ctx=ctx) # Change default grad_req behavior to increase batch size for param in mynet.collect_params().values(): param.grad_req = 'add'
this is my forward_backward function (in accordance with the gluon example - and special thanks to @safrooze for increasing batch size):
delay_rate = 8 # Batch_size = 256 def forward_backward_step(_iteration, _nbatch, _data, _label): with autograd.record(): # First argument is PREDICTIONS, Second LABELS # here jacc_idx is a dice coefficient-like loss, derived from gluon.loss.Loss losses = [ (1.0 - jacc_idx(mynet(inputs), labels)) for inputs, labels in zip(_data, _label)] # Evaluate gradients in each ctx for l in losses: l.backward() # This updates gradients across ALL devices, by first aggregating them. <3 Gluon! if (_iteration % delay_rate == 0): trainer.step(_nbatch * delay_rate) for param in mynet.collect_params().values(): param.zero_grad() return losses
The available GPUs that the (slurm) manager gives me is
ctx=[gpus(2)] for this particular example that crashes.
mxnet version 1.2.0
cuda version 8.0.61
python version 3.6.4
Any idea what goes wrong?