Multi-Threaded Inference Question

Hi,

My understanding is that MXNet supports running multiple concurrent inference calls, as long as you are using the NaiveEngine and a separate executor object on each thread (that was bound/initialized on the thread that will be using it).

I’m basing that off the MXNet 1.4 release notes: https://github.com/apache/incubator-mxnet/releases/tag/1.4.0

And the conversation around this PR that adds support for it: https://github.com/apache/incubator-mxnet/pull/12456

However I am having trouble getting it working on the GPU. Basically everything works fine on the CPU, but I get segfaults and cudnn/cuda errors when I try to run it on the GPU.

So my basic question is: has anyone been able to use MXPredCreateMultiThread on a GPU? And a follow up if so: any ideas why I’m having problems when using the non-predict C API?

The high-level view of my code is this:

void thread_worker() {
  Call MXSymbolCreateFromFile + 
       MXNDArrayLoad + 
       MXExecutorBindEX to initialize executor

  for (a bunch of times) {
     Call MXExecutorForward() on some input
  }
}

int main() {
  start a few threads executing thread_worker()
  wait for them to finish
  return 0;
}

As mentioned, everything works fine when on a CPU context. But when I try to use a GPU context I get errors like this:
1: just a segfault without description
2: some operator complains, e.g.:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Error in executor forward() function: [18:13:20] .../src/src/operator/./cudnn_rnn-inl.h:345: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Stack trace returned 10 entries:
[bt] (0) .../test_parallel_inference(dmlc::StackTrace()+0x3d) [0x41381d]
[bt] (1) .../test_parallel_inference(dmlc::LogMessageFatal::~LogMessageFatal()+0x1a) [0x413b1a]
[bt] (2) ...libmxnet.so(mxnet::op::CuDNNRNNOp<float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0xe63) [0x7f03010b8103]
[bt] (3) ...libmxnet.so(mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x33b) [0x7f03022b15eb]
[bt] (4) ...libmxnet.so(mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool)+0x6b) [0x7f03020aa37b]
[bt] (5) ...libmxnet.so(+0x2520b98) [0x7f03020b1b98]
[bt] (6) ...libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::NaiveEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext, mxnet::engine::CallbackOnComplete)+0x220) [0x7f030208a4d0]
[bt] (7) ...libmxnet.so(mxnet::engine::NaiveEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x146) [0x7f0302088746]
[bt] (8) ...libmxnet.so(mxnet::engine::NaiveEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0xc4) [0x7f0302089c94]
[bt] (9) ...libmxnet.so(mxnet::exec::GraphExecutor::RunOps(bool, unsigned long, unsigned long)+0x343) [0x7f03020b3693]

I’m happy to work on a minimal reproducible case and post to GitHub Issues, but first I wanted to confirm if this “should” work, or if it is known not to be supported for GPU inference?

Thanks,
Stephen

@skm @nswamy @zheng-da can you help with this question? Thanks!