Hello,
I’m relatively new to MxNet, so apologies off the bat if I have missed something in the documentation or the forums that I should know.
I am using the MxNet C API for inference, and trying to use it for NLP-style tasks where I have varying shape inputs. The first approach I tried was: (1) Bind an executor to a Symbol graph using the maximum size input shapes I allow; (2) Use the MXExecutorBindEX() or the MXExecutorReshape() calls for each new input I have, using the original executor as shared memory, to create a local executor specifically for the current input shape. Then use this local executor for the current input and discard it when done.
The problem I faced with the above approach is too much latency overhead. I was seeing an extra 6.5 - 8.5 ms per call when using these options (depending on if I used MXExecutorReshape or MXExecutorBindEx), compared to measuring the speed when I simply have a single fixed input size.
So the next approach I thought to try was to only allow a fixed number of different input sizes, and create an executor for each input size I support. I just need to pad all input I receive to the nearest allowable size it fits in, and then choose the appropriate executor for that shape. This way I only pay the cost of MXExecutorReshape() during initialization, as opposed to each time I want to process new data.
However when I tried this approach, it became clear that successive calls to MXExecutorReshape() invalidate the executor returned by the previous calls, so that it is only valid to use the reshaped executor associated with the very last call to MXExecutorReshape().
After looking at the code, as far as I can tell (again, I’m new to MxNet and so might be missing something), this doesn’t seem to be a fundamental limitation to the GraphExecutor class, however it is because the MxNet C API is using a single thread local vector to store NDArray objects associated with an executor. So each new call to MXExecutorReshape() clears out the thread local vector of NDArray objects that were shaped appropriately for the previous reshape call. See this code for what I am referring to:
(From. c_api_executor.cc. Many lines omitted for clarity)
int MXExecutorReshape(...) {
MXAPIThreadLocalEntry *ret = MXAPIThreadLocalStore::Get();
ret->ret_handles.clear();
for (const auto& nd : in_arg_vec) {
ret->ret_handles.push_back(new NDArray(nd))
}
}
I think this behavior could be changed, so that the thread local storage is really a map from an executor handle to a vector of NDArray objects, then the MXExecutorReshape() call wouldn’t have to wipe out anything, it would just set the thread local “ret_handles” array appropriate for the newly created executor object. Then all the other c api functions that use the thread local “ret_handles” vector would have to be modified to first get the right vector based on the ExecutorHandle they are passed. For example:
MXAPIThreadLocalEntry *ret = MXAPIThreadLocalStore::Get();
ret->ret_handles.get(executor_handle) <--- this is the Vector<NDArray> to use for executor_handle
If anyone could comment on the above, namely: Does this sound like a correct understanding of the current MxNet landscape with respect to handling varying shape inputs? If not, please help me understand better and suggest alternatives. If so, please let me know if my proposed solution sounds feasible, and if so I can try to put in a patch.
Best,
Stephen