MXNet Forum

Cannot predict with context = gpu()


#1

I’ve trained a simple net with batch normalization before activations but when I try to predict with context = gpu() I get the following error:

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/ndarray/ndarray.py", line 1972, in asnumpy

ctypes.c_size_t(data.size)))

File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/base.py", line 252, in check_call

raise MXNetError(py_str(_LIB.MXGetLastError()))

mxnet.base.MXNetError: [15:34:32] src/operator/nn/./cudnn/cudnn_batch_norm-inl.h:157: Check failed: e == CUDNN_STATUS_SUCCESS (9 vs. 0) cuDNN: CUDNN_STATUS_NOT_SUPPORTED

Stack trace returned 10 entries:

[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x36161a) [0x7fcc1f92461a]

[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x361c31) [0x7fcc1f924c31]

[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x335345f) [0x7fcc2291645f]

[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x33571b7) [0x7fcc2291a1b7]

[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a6fe6f) [0x7fcc22032e6f]

[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a76aec) [0x7fcc22039aec]

[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a55354) [0x7fcc22018354]

[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a59623) [0x7fcc2201c623]

[bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a59876) [0x7fcc2201c876]

[bt] (9) /home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x2a55a64) [0x7fcc22018a64]

I’m calling the model with:
dataiter = mx.io.NDArrayIter(data=data, label=label, batch_size=data.shape[0])
model = mx.module.Module.load(prefix=modelPrefix, epoch=epoch, context = mx.gpu())
model.bind(data_shapes=dataiter.provide_data,label_shapes=dataiter.provide_label)
preds = model.predict(dataiter)


#2

This error points to an unsupported use of CUDNN in the operator implementation. Would you be able to simplify your network to figure out which operator is causing this error?


#3

Do you mean work backwards, pruning layers? I assumed it was one of the batch_norms because of the “src/operator/nn/./cudnn/cudnn_batch_norm-inl.h:157” in the error output.

Doesn’t it seem odd that I trained this on an Amazon deep learning ami using a gpu with no problems, but using the same set up predict bombs?


#4

Sorry I didn’t read the error very carefully. You’re right that it appears that the error is from batch_norm. Are data sizes or batch sizes any different between train and predict?


#5

Yes they are… The model is a deep matrix factorization so I’m predicting just iterating over the row,column coordinates.


#6

Without having access to your code, I’m just hypothesizing that somehow the change in batch-size is causing this issue. Did you try keeping the batch-size the same or just not equal to 1? (not as a solution, but as a debugging step).