CUDA: Unspecified launch failure


#1

During training, my program suddenly failed raising the following error:

INFO:root:Epoch[18] Batch [17568]	Speed: 50177.37 samples/sec	SumMetric=706.347520
INFO:root:Epoch[18] Batch [18056]	Speed: 52638.94 samples/sec .     SumMetric=704.075347
INFO:root:Epoch[18] Batch [18544]	Speed: 52356.88 samples/sec	SumMetric=709.324801
[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29789c8) [0x7f2ba366b9c8]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x295abe6) [0x7f2ba364dbe6]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x170be60) [0x7f2ba23fee60]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x330c23) [0x7f2ba1023c23]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d763) [0x7f2ba2270763]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d966) [0x7f2ba2270966]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x173135e) [0x7f2ba242435e]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x17146ec) [0x7f2ba24076ec]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x330c23) [0x7f2ba1023c23]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d763) [0x7f2ba2270763]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157d966) [0x7f2ba2270966]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

[09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [09:51:00] src/engine/./threaded_engine.h:347: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1577854) [0x7f2ba226a854]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [09:51:00] src/engine/./threaded_engine.h:347: [09:51:00] /home/travis/build/dmlc/mxnet-distro/mxnet-build/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15aa688) [0x7f2ba229d688]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1595218) [0x7f2ba2288218]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x15775ad) [0x7f2ba226a5ad]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (7) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (9) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23e38c) [0x7f2ba0f3138c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1577854) [0x7f2ba226a854]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b6e3) [0x7f2ba226e6e3]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x157b8e6) [0x7f2ba226e8e6]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1578b4b) [0x7f2ba226bb4b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f2c53c5ec80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f2c58c506ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2c589863dd]

Strange issue when using HybridBlock in a symbolic program
#2

Have you figured out what happened here?


#3

The current conjecture is that it was the transpose operator not being registered. But regardless, shouldn’t there be a clearer debug message? I haven’t seen this error since moving to mxnet 0.12.0


#4

@madjam This occurred again with mxnet 0.11.1, I think it some what confirms what the issue was


#5

Actually I managed to decompose the issue -

If i do

pip3 list
mxnet (0.12.0, /home/ubuntu/dev-repos/mxnet/python)
mxnet-cu80 (0.11.0)

I think Im getting this error because when I try to install from source (using pip) the mxnet-cu80 install isn’t upgraded. Maybe that’s it? I still get this error periodically @madjam @leopd @reminisce


#6

Here’s the entire error:

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 9 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow6StreamINS_3gpuEE4WaitEv+0xd8) [0x7f5d782cb418]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x2c06bb0) [0x7f5d78746bb0]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE1_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786cacfb]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

[01:24:15] /home/ubuntu/dev-repos/mxnet/dmlc-core/include/dmlc/./logging.h:308: [01:24:15] src/engine/./threaded_engine.h:370: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 7 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x332) [0x7f5d786c2ba2]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [01:24:15] src/engine/./threaded_engine.h:370: [01:24:15] /home/ubuntu/dev-repos/mxnet/mshadow/mshadow/./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN7mshadow4CopyINS_3cpuENS_3gpuELi1EfEEvNS_6TensorIT_XT1_ET2_EENS3_IT0_XT1_ES5_EE14cudaMemcpyKindPNS_6StreamIS2_EE+0x1f8) [0x7f5d796877e8]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3gpuENS2_3cpuEEEvRKNS_5TBlobEPS5_NS_7ContextES9_NS_10RunContextE+0x2f4e) [0x7f5d79666f2e]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27f5051) [0x7f5d78335051]
[bt] (4) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataOS1_OS3_+0x4b) [0x7f5d78184c2b]
[bt] (5) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x93) [0x7f5d786c2903]
[bt] (6) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (7) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 7 entries:
[bt] (0) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5d7626fb5c]
[bt] (1) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine15ExecuteOprBlockENS_10RunContextEPNS0_8OprBlockE+0x332) [0x7f5d786c2ba2]
[bt] (2) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEZZNS2_23ThreadedEnginePerDevice13PushToExecuteEPNS2_8OprBlockEbENKUlvE0_clEvEUlS5_E_E9_M_invokeERKSt9_Any_dataOS5_+0x13b) [0x7f5d786ca89b]
[bt] (3) /home/ubuntu/dev-repos/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt8functionIFvSt10shared_ptrIN5mxnet6engine10ThreadPool11SimpleEventEEEES8_EEE6_M_runEv+0x4a) [0x7f5d786c4d0a]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f5d91a3cc80]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f5d973316ba]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5d970673dd]

#7

Very likely due to invalid driver problem. You can test it with a minimal script and see if GPU works well.


#8

I doubt it. Other experiments seem to run fine?


#9

Can you post a minimal script to reproduce the error?


#10

Not externally Im afraid. The error is “random” - in the sense that I cannot deterministically reproduce it


#11

@zhreshold Do you have a way to check this? Its still happening with Python 3 and MXNet 0.12.0