Fatal error occurred in asynchronous engine operation?

cpp
code-error
#1

Everytime i run my code, this error occurs:

But when i set MXNET_ENGINE_TYPE as told and debug it on pycharm, the code runs without any error. Has anyone met this problem before? How can i figure out which op is causing this error message here?
Thank you for reading my problem and any suggestions are welcome!

#2

Can you give some more information e.g. do you have a minimum reproducible example? how do you run MXNet? Which MXNet version are you using?
Such an error can occur when for instance an MXNet model is called by multiple threads at the same time.

#3

I add an im2col operation on the latest version(1.3.1) and compile it from source. The code i am running is baisically gluon-cv train, which internally calls parallel_apply. I am attempting to train a model which utilizes the im2col operation and this error occurs. I tried to run an model without the customized im2col op and everything went on well. Then i tested the im2col op using symbol’s module.fit and it looked fine as well. So i am a bit confused now… Does that mean a customized op doesn’t support to be called parallelly?

#4

Did you write your own operator or did you use this one: https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/im2col.h ? Can you share a small example so that I can reproduce the problem?

#5

Yeah, i use the im2col.h’s implementation and write the registeration so that i can call it in python. The modified mxnet is im2col_op. I may not be able to provide you with a ‘small’ example, the error only occurs when i use the forementioned parallel_apply function, so you still need to clone gluon-cv and simply add an im2col op any where in a existing model structure, then train it. The error will show up then.