Cudnn_status_bad_param using Conv2D


#1

I wrote a program using MXNet, which ran fine on CPU but started throwing the following errors on GPU:

File "/home/ubuntu/.local/lib/python3.5/site-packages/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/home/ubuntu/.local/lib/python3.5/site-packages/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:42:11] src/operator/./cudnn_convolution-inl.h:392: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x272c4c) [0x7ff356fbac4c]
[bt] (1) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x360d2d2) [0x7ff35a3552d2]
[bt] (2) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x36019cd) [0x7ff35a3499cd]
[bt] (3) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2374fe4) [0x7ff3590bcfe4]
[bt] (4) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2300218) [0x7ff359048218]
[bt] (5) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x341955) [0x7ff357089955]
[bt] (6) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x21395c8) [0x7ff358e815c8]
[bt] (7) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x213dfd8) [0x7ff358e85fd8]
[bt] (8) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2094541) [0x7ff358ddc541]
[bt] (9) /home/ubuntu/.local/lib/python3.5/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x63) [0x7ff358ddc8e3]

Here is my output from nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   45C    P0    74W / 149W |      0MiB / 11439MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found a similar issue on the Github: https://github.com/NVIDIA/DIGITS/issues/258

I’m using python 3.5, cuda 8.0, mxnet 0.12.0 on ubuntu 16.04 LTS. Could anyone point how to get around with this error? Thanks.


#2

With no process running on gpu, the GPU-utilization is 98%, which seems to be weird, try reset GPU and try again?


#3

I have reset the GPU and reran but the same error:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   57C    P0    73W / 149W |    147MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2774      C   python3                                      136MiB |
+-----------------------------------------------------------------------------+

I’m running this on a AWS EC2 P2 instance, and have also tried to reboot, but that doesn’t seem to solve the issue. Does it need to do with a certain version of CUDA (I’m using 8.0)?


#4

Any minimal reproducible code can be shared?


#5

I was able to replicate the error using the following code:

class CustomBlock(gluon.Block):
    def __init__(self):
        super(CustomBlock, self).__init__()

        with self.name_scope():
            # base convolution
            self.base_conv = gluon.nn.Conv2D(channels=64, kernel_size=(1, 297), strides=(1, -1), activation='relu')

            # output layer
            self.out = gluon.nn.Dense(num_class)

    def forward(self, x):
        # base convolution
        x = self.base_conv(x)

        # output layer
        x = self.out(x)
        return x

The input x has the dimension of (batch_size, 1, 250, 297). This code works fine for both GPU and CPU if the base convolution, x = self.base_conv(x), is not used (so it uses only the output layer), but starts throwing the error when both the convolution and output layers are used. Any feedback will be appreciated. Thanks!


#6

stride -1 is not allowed for convolution. You can use Con1D in this case.


#7

That did the trick; after I replace -1 with the actual dimension, the Conv2D ran just fine. Thank you very much.