Check failed: e == cudaSuccess CUDA: invalid device ordinal

Dear all,

I am getting a weird error (am not sure if it’s cuda/mxnet bug?), perhaps someone can help. I have a code for running a semantic segmentation problem, and it runs fine on HPC cluster - single node - when I use 4 GPUs (all of nodes GPUs). However, if I request less than 4 gpus, I get the following error:

terminate called after throwing an instance of 'dmlc::Error'
  what():  [14:46:44] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./tensor_gpu-inl.h:35: Check failed: e == cudaSuccess CUDA: invalid device ordinal

Stack trace returned 9 entries:
[bt] (0) /home/dia021/Software/mxnet/libmxnet.so(+0x2d7c72) [0x2aaab329fc72]
[bt] (1) /home/dia021/Software/mxnet/libmxnet.so(+0x2d8248) [0x2aaab32a0248]
[bt] (2) /home/dia021/Software/mxnet/libmxnet.so(+0x2695bd0) [0x2aaab565dbd0]
[bt] (3) /home/dia021/Software/mxnet/libmxnet.so(+0x269d1d8) [0x2aaab56651d8]
[bt] (4) /home/dia021/Software/mxnet/libmxnet.so(+0x269d45e) [0x2aaab566545e]
[bt] (5) /home/dia021/Software/mxnet/libmxnet.so(+0x269767b) [0x2aaab565f67b]
[bt] (6) /home/dia021/Software/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x2aaae5ff2c5c]
[bt] (7) /lib64/libpthread.so.0(+0x8744) [0x2aaaaacd6744]
[bt] (8) /lib64/libc.so.6(clone+0x6d) [0x2aaaaafd4aad]


Aborted (core dumped)

Inside my code (I mainly follow the gluon example here), I define ctx in the following way:

gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')]
ctx = [mx.gpu(i)  for i in gpus]

definition of my NN (ResUNet_d5 is something like UNet with skip connections and inception-like modules - well tested):

mynet = ResUNet_d5(_nfilters_init=nfilters_init, _NClasses = NClasses)
mynet.collect_params().initialize(mx.initializer.Xavier(),ctx=ctx) 

# Change default grad_req behavior to increase batch size
for param in mynet.collect_params().values():
    param.grad_req = 'add'

this is my forward_backward function (in accordance with the gluon example - and special thanks to @safrooze for increasing batch size):

delay_rate = 8 # Batch_size = 256
def forward_backward_step(_iteration, _nbatch,  _data, _label):
    with autograd.record():
        # First argument is PREDICTIONS, Second LABELS 
        # here jacc_idx is a dice coefficient-like loss, derived from gluon.loss.Loss 
        losses = [ (1.0 -  jacc_idx(mynet(inputs), labels)) for inputs, labels in zip(_data, _label)]

    # Evaluate gradients in each ctx
    for l in losses: 
        l.backward()

    # This updates gradients across ALL devices, by first aggregating them. <3 Gluon!
    if (_iteration % delay_rate == 0):
        trainer.step(_nbatch * delay_rate)   
        for param in mynet.collect_params().values():
            param.zero_grad()

    return losses

The available GPUs that the (slurm) manager gives me is ctx=[gpus(2)] for this particular example that crashes.

mxnet version 1.2.0
cuda version 8.0.61
python version 3.6.4

Any idea what goes wrong?

Thanks

It seems by changing

gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')]
ctx = [mx.gpu(i)  for i in gpus]

to

gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')]
ctx = [mx.gpu(i)  for i in range(len(gpus))]

everything is working. So it is that internally mxnet addresses number to available gpus starting from 0 (I assume) this is why it fails when it sees [gpu(2)] in previous example.