CUDA: Unspecified launch failure


#1

During training, my program suddenly failed raising the following error:

INFO:root:Epoch[18] Batch [17568]	Speed: 50177.37 samples/sec	SumMetric=706.347520
INFO:root:Epoch[18] Batch [18056]	Speed: 52638.94 samples/sec .     SumMetric=704.075347
INFO:root:Epoch[18] Batch [18544]	Speed: 52356.88 samples/sec	SumMetric=709.324801
[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29789c8) [0x7f2ba366b9c8]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x295abe6) [0x7f2ba364dbe6]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x170be60) [0x7f2ba23fee60]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x330c23) [0x7f2ba1023c23]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d763) [0x7f2ba2270763]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d966) [0x7f2ba2270966]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x173135e) [0x7f2ba242435e]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x17146ec) [0x7f2ba24076ec]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x330c23) [0x7f2ba1023c23]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d763) [0x7f2ba2270763]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d966) [0x7f2ba2270966]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] src/engine/./threaded_engine.h:347: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1577854) [0x7f2ba226a854]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [09:51:00] src/engine/./threaded_engine.h:347: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1577854) [0x7f2ba226a854]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

Strange issue when using HybridBlock in a symbolic program
#2

Have you figured out what happened here?


#3

The current conjecture is that it was the transpose operator not being registered. But regardless, shouldn’t there be a clearer debug message? I haven’t seen this error since moving to mxnet 0.12.0


#4

@madjam This occurred again with mxnet 0.11.1, I think it some what confirms what the issue was


#5

Actually I managed to decompose the issue -

If i do

pip3 list
mxnet (0.12.0, /home/ubuntu/dev-repos/mxnet/python)
mxnet-cu80 (0.11.0)

I think Im getting this error because when I try to install from source (using pip) the mxnet-cu80 install isn’t upgraded. Maybe that’s it? I still get this error periodically @madjam @leopd @reminisce


#6

Here’s the entire error:

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 9 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow6StreamINS_3gpuEE4WaitEv+0xd8) [0x7f5d782cb418]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2c06bb0) [0x7f5d78746bb0]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786cacfb]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] src/engine/./threaded_engine.h:370: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 7 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x332) [0x7f5d786c2ba2]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [01:24:15] src/engine/./threaded_engine.h:370: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 7 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x332) [0x7f5d786c2ba2]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

#7

Very likely due to invalid driver problem. You can test it with a minimal script and see if GPU works well.


#8

I doubt it. Other experiments seem to run fine?


#9

Can you post a minimal script to reproduce the error?


#10

Not externally Im afraid. The error is “random” - in the sense that I cannot deterministically reproduce it


#11

@zhreshold Do you have a way to check this? Its still happening with Python 3 and MXNet 0.12.0


#12

Same problem with cuda8.0, mxnet 1.0 and windows server 2016. I have tested train_cifar10.py, it works well. But when I use a complex model including mx.sym.{arange, slice, tile,take, concat,…}, it will throw unspecified launch failure. The strange thing is that it appears after different iterations, looks like random. It would be helpful if someone can provide some suggestions on how to debug such errors. @zhreshold


#13

I have same problem with mxnet 1.1
train_imagenet.py from examples. It works with mxnet 1.0 for resnet50 and resnext50, with mxnet 1.1 for resnet50, but not for resnext50
(Windows 10 x64 1709, GTX1070 dirver 390.65 , CUDA 8.0, late CUDA 9.1 with same result)

d:\work\cpp\mxnet\example\image-classification>python train_imagenet.py --model-prefix=netC/resnext50 --network=resnext --num-classes=12 --num-examples=15048 --gpus=0 --batch-size=32 --num-epochs=100 --data-train=d:/work/cpp/mnist/new/cpeople_train.rec --data-train-idx=d:/work/cpp/mnist/new/cpeople_train.idx --data-val=d:/work/cpp/mnist/new/cpeople_val.rec --data-val-idx=d:/work/cpp/mnist/new/cpeople_val.idx
INFO:root:start with arguments Namespace(batch_size=32, benchmark=0, data_nthreads=4, data_train='d:/work/cpp/mnist/new/cpeople_train.rec', data_train_idx='d:/work/cpp/mnist/new/cpeople_train.idx', data_val='d:/work/cpp/mnist/new/cpeople_val.rec', data_val_idx='d:/work/cpp/mnist/new/cpeople_val.idx', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,224,224', initializer='default', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', macrobatch_size=0, max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix='netC/resnext50', mom=0.9, monitor=0, network='resnext', num_classes=12, num_epochs=100, num_examples=15048, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
[13:05:46] D:\work\cpp\mxnet\src\io\iter_image_recordio_2.cc:170: ImageRecordIOParser2: d:/work/cpp/mnist/new/cpeople_train.rec, use 1 threads for decoding..
[13:05:49] D:\work\cpp\mxnet\src\io\iter_image_recordio_2.cc:170: ImageRecordIOParser2: d:/work/cpp/mnist/new/cpeople_val.rec, use 1 threads for decoding..
[13:05:52] d:\work\cpp\mxnet\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
INFO:root:Epoch[0] Batch [20]   Speed: 66.71 samples/sec        accuracy=0.180060
...
INFO:root:Epoch[1] Batch [320]  Speed: 66.28 samples/sec        accuracy=0.401562
Traceback (most recent call last):
  File "train_imagenet.py", line 58, in <module>
    fit.fit(args, sym, data.get_rec_iter)
  File "d:\work\cpp\mxnet\example\image-classification\common\fit.py", line 285, in fit
    monitor=monitor)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\module\base_module.py", line 496, in fit
    self.update_metric(eval_metric, data_batch.label)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\module\module.py", line 749, in update_metric
    self._exec_group.update_metric(eval_metric, labels)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\module\executor_group.py", line 616, in update_metric
    eval_metric.update_dict(labels_, preds)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\metric.py", line 280, in update_dict
    metric.update_dict(labels, preds)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\metric.py", line 108, in update_dict
    self.update(label, pred)
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\metric.py", line 394, in update
    pred_label = pred_label.asnumpy().astype('int32')
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\ndarray\ndarray.py", line 1801, in asnumpy
    ctypes.c_size_t(data.size)))
  File "C:\Anaconda3\lib\site-packages\mxnet-1.1.0-py3.6.egg\mxnet\base.py", line 148, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [13:12:40] d:\work\cpp\mxnet\mshadow\mshadow\./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unknown error

#14

@zhreshold This seems to be a prevelant issue, any clues?


#15

I have this problem with versons 1.0.1 and 1.1.0, but in 1.0.0 is ok. No matter CUDA 8 or 9.1
May be mshadow is blame? I have downgraded my mxnet to 1.0.0 for now


#16

Maybe there are more than one process that occupies CUDA resources? For any unspecified or unknown CUDA error, there’s not to much we can help diagnose without log and reproducible code.


#17

How exactly do we get you a log? If you chime me, I can give you the code for the error -

Dhruv