Hello! One of the users on my server has been running into an issue with MxNet not detecting the GPU. The output of nvidia-smi
is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.46 Driver Version: 390.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:02:00.0 Off | 0 |
| N/A 66C P0 181W / 250W | 15105MiB / 16280MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000000:82:00.0 Off | 0 |
| N/A 30C P0 28W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 33904 C python3 15095MiB |
+-----------------------------------------------------------------------------+
So, I can confirm the system has a GPU. However, upon running a simple MxNet calculation, we get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2421, in ones
return _internal._ones(shape=shape, ctx=ctx, dtype=dtype, **kwargs)
File "<string>", line 34, in _ones
File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
ctypes.byref(out_stypes)))
File "/export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [12:48:33] src/engine/threaded_engine.cc:328: Check failed: device_count_ > 0 (-1 vs. 0) GPU usage requires at least 1 GPU
Stack trace returned 10 entries:
[bt] (0) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f23c2) [0x7f4683aac3c2]
[bt] (1) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3f2988) [0x7f4683aac988]
[bt] (2) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a575e) [0x7f468675f75e]
[bt] (3) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30aabaf) [0x7f4686764baf]
[bt] (4) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)+0x2cd) [0x7f4686806a4d]
[bt] (5) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::InvokeOp(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode, mxnet::OpStatePtr)+0x2b3) [0x7f468680b313]
[bt] (6) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::Imperative::Invoke(mxnet::Context const&, nnvm::NodeAttrs const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x368) [0x7f468680c0b8]
[bt] (7) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x305ad8b) [0x7f4686714d8b]
[bt] (8) /export/tools/anaconda/anaconda3/lib/python3.6/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x6f) [0x7f468671534f]
[bt] (9) /export/tools/anaconda/anaconda3/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f46ec83aec0]
I am quite confused by the Check failed: device_count_ > 0 (-1 vs. 0)
line. I assume the device count being -1 is the result of some detection function returning an error, but I haven’t been able to find any documentation on the process by which that happens. If anyone else has come across this issue I would appreciate the insight.