CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

#1

Hi, I’m trying to using following code to predict with a resnet50:

ctxs = [mx.gpu(0),mx.gpu(1)]

def test(net, test_data_loader, threshold, ctxs):
predicted = [ ]

for i, data in enumerate(test_data_loader):
    data = gluon.utils.split_and_load(data, ctx_list=ctxs, batch_axis=0, even_split=False)
    outputs = [net(X) for X in data]
    for output in outputs:
        output = output.as_in_context(ctxs[0])
    outputs = nd.concat(*outputs,dim=0)
    score_predicts = nd.sigmoid(outputs)

    label_predicts = [np.arange(28) [ np.argwhere( (score_predict).asnumpy() > threshold ) ] \
                  for score_predict in score_predicts]
    str_predict_labels = [' '.join(str(np.asscalar(l)) for l in label_predict) for label_predict in label_predicts]
    predicted.extend(str_predict_labels)

But this code only works occasionally. At most time it come back with following error message:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-31-18b005eb63dd> in <module>()
      5 test_dataset = ProteinDataset(root=TEST,size=size,transform=joint_transform,istrain=False)
      6 test_data_loader = mx.gluon.data.DataLoader(test_dataset,batch_size=batch_size,shuffle=False,num_workers=num_workers)
----> 7 test(finetune_net, test_dataset,test_data_loader, threshold, ctxs,file_name)

<ipython-input-30-9ddd755ca136> in test(net, test_dataset, test_data_loader, threshold, ctxs, file_name)
     14         score_predicts = nd.sigmoid(outputs)
     15 
---> 16         label_predicts = [np.arange(28) [ np.argwhere( (score_predict).asnumpy() > threshold ) ]                       for score_predict in score_predicts]
     17         str_predict_labels = [' '.join(str(np.asscalar(l)) for l in label_predict) for label_predict in label_predicts]
     18         predicted.extend(str_predict_labels)

<ipython-input-30-9ddd755ca136> in <listcomp>(.0)
     14         score_predicts = nd.sigmoid(outputs)
     15 
---> 16         label_predicts = [np.arange(28) [ np.argwhere( (score_predict).asnumpy() > threshold ) ]                       for score_predict in score_predicts]
     17         str_predict_labels = [' '.join(str(np.asscalar(l)) for l in label_predict) for label_predict in label_predicts]
     18         predicted.extend(str_predict_labels)

~/anaconda3/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1970             self.handle,
   1971             data.ctypes.data_as(ctypes.c_void_p),
-> 1972             ctypes.c_size_t(data.size)))
   1973         return data
   1974 

~/anaconda3/lib/python3.7/site-packages/mxnet/base.py in check_call(ret)
    250     """
    251     if ret != 0:
--> 252         raise MXNetError(py_str(_LIB.MXGetLastError()))
    253 
    254 

MXNetError: [19:33:11] src/operator/nn/./cudnn/cudnn_activation-inl.h:129: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Stack trace returned 10 entries:
[bt] (0) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3617ba) [0x7f125e1d97ba]
[bt] (1) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x361dd1) [0x7f125e1d9dd1]
[bt] (2) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x336b6c5) [0x7f12611e36c5]
[bt] (3) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x336de58) [0x7f12611e5e58]
[bt] (4) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2aead6a) [0x7f1260962d6a]
[bt] (5) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2a4ba67) [0x7f12608c3a67]
[bt] (6) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2a4ba67) [0x7f12608c3a67]
[bt] (7) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2a4ba67) [0x7f12608c3a67]
[bt] (8) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2a4ba67) [0x7f12608c3a67]
[bt] (9) /home/jumpywizard/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2a4ba67) [0x7f12608c3a67]

Where is the problem? Thanks!
By the way, I’m using UBUNTU 18.04, CUDA 9.2, CUDNN 7.1.4 on GTX 1080Ti and MXNet 1.3.0.

#2

CUDNN_STATUS_EXECUTION_FAILED is just a generic error message meaning that execution using the GPU has failed for this specific operation. Could you explain your logic behind this line:

label_predicts = [np.arange(28) [ np.argwhere( (score_predict).asnumpy() > threshold ) ] for score_predict in score_predicts]

We’re getting the error when this line of Python is executed.

#3

Would be great if you could add three backticks before and after your code (i.e. ```) to get better formatting in your post. Otherwise you get strange squares appearing in your code!

#4

Thanks!
I’ve worked it out.

data = gluon.utils.split_and_load(data, ctx_list=ctxs, batch_axis=0, even_split=False)
outputs = [net(X) for X in data]

Above two lines place data and output in different device, so if I am to concat them together and call

asnumpy(), I get an error.
The simplest way to fix it is place output at the same device before concatenating.

#5

Glad you managed to find the solution!

#6

I meet the same issue…and it trouble me for a long time. I want wan to know how to address it. the error of mine is :

mxnet.base.MXNetError: [18:12:09] src/operator/nn/./cudnn/cudnn_convolution-inl.h:156: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

#7

Hi @bgao,

Any luck training the model on CPU context? You might find that your issue also effects training on CPU too, and there could be a more informative error for you. Otherwise make sure your data and parameters are all in the same context (i.e. same GPU).

#8

yes,it happened when i use a single GPU to train the model. And sometime i can train it at the first epoch, but the error happened in validation stage.

#9

Can you confirm what happens when you train using CPU please.

#10

Hi there, I got the same error with @bgao, namely, when using gpu, the model could be trained for one epoch or so and then I got Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED. This error repeatedly appears when training the model on gpu, but at different time (epoch/batch), even from different line. The same model has been successfully trained on CPU.

I tried conda and docker container as environment. Here is the configuration.
From nvidia/cuda:9.2-cudnn7-devel-ubuntu16.04 RUN pip install mxnet-cu92mkl

I also tried different batch size/learning rate, suspecting that there might be numeric issues when lr is large. However, there is no luck even when using very moderate learning rate.

Here are the error messages. Any clues @thomelane ? Thanks!

Traceback (most recent call last):
  File "/workdir/code/src/project_main.py", line 144, in <module>
    main(args)
  File "/workdir/code/src/project_main.py", line 128, in main
    do_offline_evaluation=args.do_offline_evaluation)
  File "/workdir/code/src/project/estimator/train_pred_eval.py", line 123, in train
    step_loss = loss.asscalar()
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1998, in asscalar
    return self.asnumpy()[0]
  File "/usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py", line 1980, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:09:37] src/operator/./cudnn_rnn-inl.h:710: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3d9c92) [0x7f15a0ac2c92]
[bt] (1) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3da268) [0x7f15a0ac3268]
[bt] (2) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c50cb4) [0x7f15a6339cb4]
[bt] (3) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x5c52af6) [0x7f15a633baf6]
[bt] (4) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x33f2924) [0x7f15a3adb924]
[bt] (5) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x361) [0x7f15a38b8791]
[bt] (6) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushOperator(mxnet::OpStatePtr const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, mxnet::DispatchMode)::{lambda(mxnet::RunContext)#4}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x26) [0x7f15a38b8de6]
[bt] (7) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x312b223) [0x7f15a3814223]
[bt] (8) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x3132f64) [0x7f15a381bf64]
[bt] (9) /usr/local/lib/python3.6/dist-packages/mxnet/libmxnet.so(+0x313702b) [0x7f15a382002b]

#11

Can’t see any clues in there, but try setting the following enironment variable before running your code.

MXNET_ENGINE_TYPE=NaiveEngine

You see the error from loss.asscalar() right now but this is just because it’s a blocking operation. You could have an issue anywhere in your network. When you set this enironment variable it performs operations in serial rather than parallel, so there’s more chance you’ll see where the true error is. It’s also very slow because of this, so only use for debugging.