Newbie Question: DMLC Error: EC2 CUDA devices not available

I am seeing the following error. But there is no other jobs running on the EC2 instances. Have you anyone seen this problem before?

terminate called after throwing an instance of ‘dmlc::Error’
what(): [23:45:47] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:182: Check failed: e == cudaSuccess CUDA: all CUDA-capable devices are busy or unavailable

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x276938) [0x7f6862f52938]
[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x276d48) [0x7f6862f52d48]
[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23e2213) [0x7f68650be213]
[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23e4265) [0x7f68650c0265]
[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23f78ce) [0x7f68650d38ce]
[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23f7b26) [0x7f68650d3b26]
[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x23f1ddb) [0x7f68650cdddb]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p27/bin/…/lib/libstdc++.so.6(+0xafc5c) [0x7f692a713c5c]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f692b7556ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f692ad7b41d]

Up the post? Can anyone help me with the issue? It’s been four days.

Hi

Would you be able to provide more info to reproduce and/or debug? Minimal code example, what ec2 instance type (number of GPUs) you’re running, the output of the watch query-compute-apps command below while you’re running the program.

You mentioned no other proesses running, you may want to double-check with nvidia-smi as per

In other words, if you run,

watch nvidia-smi --query-compute-apps=pid,gpu_name,gpu_uuid,process_name,used_memory

You’ll see a print out like:

pid, gpu_name, process_name, used_gpu_memory [MiB]
6001, Tesla V100-SXM2-16GB, /usr/bin/python2, 916 MiB
6001, Tesla V100-SXM2-16GB, /usr/bin/python2, 910 MiB

You may want to kill those processes with kill $pid and trigger a GPU reset (see nvidia-smi --help for a description of the option)

nvidia-smi -r

Vishaal