GPU count is -1

Hello! One of the users on my server has been running into an issue with MxNet not detecting the GPU. The output of nvidia-smi is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.46                 Driver Version: 390.46                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   66C    P0   181W / 250W |  15105MiB / 16280MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:82:00.0 Off |                    0 |
| N/A   30C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     33904      C   python3                                    15095MiB |
+-----------------------------------------------------------------------------+

So, I can confirm the system has a GPU. However, upon running a simple MxNet calculation, we get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2421, in ones
    return _internal._ones(shape=shape, ctx=ctx, dtype=dtype, **kwargs)
  File "<string>", line 34, in _ones
  File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [12:48:33] src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) GPU usage requires at least 1 GPU

Stack trace returned 10 entries:
[bt] (0) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f23c2) [0x7f4683aac3c2]
[bt] (1) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f2988) [0x7f4683aac988]
[bt] (2) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a575e) [0x7f468675f75e]
[bt] (3) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30aabaf) [0x7f4686764baf]
[bt] (4) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0x2cd) [0x7f4686806a4d]
[bt] (5) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0x2b3) [0x7f468680b313]
[bt] (6) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x368) [0x7f468680c0b8]
[bt] (7) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x305ad8b) [0x7f4686714d8b]
[bt] (8) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x6f) [0x7f468671534f]
[bt] (9) /export/tools/anaconda/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f46ec83aec0]

I am quite confused by the Check failed: device_count_ > 0 (-1 vs. 0) line. I assume the device count being -1 is the result of some detection function returning an error, but I haven’t been able to find any documentation on the process by which that happens. If anyone else has come across this issue I would appreciate the insight.

Hi @zacharied,

Are you using a CUDA build of MXNet? Give the following a try…

pip install mxnet-cu90 --upgrade 

Changing cu90 to the version of CUDA you’re running on the server.