Cudnn_status_bad_param using Conv2D

I wrote a program using MXNet, which ran fine on CPU but started throwing the following errors on GPU:

File "/home/ubuntu/.local/lib/python3.5/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/.local/lib/python3.5/site-packages/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:42:11] src/operator/./cudnn_convolution-inl.h:392: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x272c4c) [0x7ff356fbac4c]
[bt] (1) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x360d2d2) [0x7ff35a3552d2]
[bt] (2) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x36019cd) [0x7ff35a3499cd]
[bt] (3) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2374fe4) [0x7ff3590bcfe4]
[bt] (4) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2300218) [0x7ff359048218]
[bt] (5) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x341955) [0x7ff357089955]
[bt] (6) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x21395c8) [0x7ff358e815c8]
[bt] (7) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x213dfd8) [0x7ff358e85fd8]
[bt] (8) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2094541) [0x7ff358ddc541]
[bt] (9) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x63) [0x7ff358ddc8e3]

Here is my output from nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |      0MiB / 11439MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found a similar issue on the Github: https://github.com/NVIDIA/DIGITS/issues/258

I’m using python 3.5, cuda 8.0, mxnet 0.12.0 on ubuntu 16.04 LTS. Could anyone point how to get around with this error? Thanks.

With no process running on gpu, the GPU-utilization is 98%, which seems to be weird, try reset GPU and try again?

I have reset the GPU and reran but the same error:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   57C    P0    73W / 149W |    147MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2774      C   python3                                      136MiB |
+-----------------------------------------------------------------------------+

I’m running this on a AWS EC2 P2 instance, and have also tried to reboot, but that doesn’t seem to solve the issue. Does it need to do with a certain version of CUDA (I’m using 8.0)?

Any minimal reproducible code can be shared?

I was able to replicate the error using the following code:

class CustomBlock(gluon.Block):
    def __init__(self):
        super(CustomBlock, self).__init__()

        with self.name_scope():
            # base convolution
            self.base_conv = gluon.nn.Conv2D(channels=64, kernel_size=(1, 297), strides=(1, -1), activation='relu')

            # output layer
            self.out = gluon.nn.Dense(num_class)

    def forward(self, x):
        # base convolution
        x = self.base_conv(x)

        # output layer
        x = self.out(x)
        return x

The input x has the dimension of (batch_size, 1, 250, 297). This code works fine for both GPU and CPU if the base convolution, x = self.base_conv(x), is not used (so it uses only the output layer), but starts throwing the error when both the convolution and output layers are used. Any feedback will be appreciated. Thanks!

stride -1 is not allowed for convolution. You can use Con1D in this case.

That did the trick; after I replace -1 with the actual dimension, the Conv2D ran just fine. Thank you very much.