MxNet C Api Executor Reshape Question


#1

Hello,

I’m relatively new to MxNet, so apologies off the bat if I have missed something in the documentation or the forums that I should know.

I am using the MxNet C API for inference, and trying to use it for NLP-style tasks where I have varying shape inputs. The first approach I tried was: (1) Bind an executor to a Symbol graph using the maximum size input shapes I allow; (2) Use the MXExecutorBindEX() or the MXExecutorReshape() calls for each new input I have, using the original executor as shared memory, to create a local executor specifically for the current input shape. Then use this local executor for the current input and discard it when done.

The problem I faced with the above approach is too much latency overhead. I was seeing an extra 6.5 - 8.5 ms per call when using these options (depending on if I used MXExecutorReshape or MXExecutorBindEx), compared to measuring the speed when I simply have a single fixed input size.

So the next approach I thought to try was to only allow a fixed number of different input sizes, and create an executor for each input size I support. I just need to pad all input I receive to the nearest allowable size it fits in, and then choose the appropriate executor for that shape. This way I only pay the cost of MXExecutorReshape() during initialization, as opposed to each time I want to process new data.

However when I tried this approach, it became clear that successive calls to MXExecutorReshape() invalidate the executor returned by the previous calls, so that it is only valid to use the reshaped executor associated with the very last call to MXExecutorReshape().

After looking at the code, as far as I can tell (again, I’m new to MxNet and so might be missing something), this doesn’t seem to be a fundamental limitation to the GraphExecutor class, however it is because the MxNet C API is using a single thread local vector to store NDArray objects associated with an executor. So each new call to MXExecutorReshape() clears out the thread local vector of NDArray objects that were shaped appropriately for the previous reshape call. See this code for what I am referring to:

(From. c_api_executor.cc. Many lines omitted for clarity)
int MXExecutorReshape(...) {
    MXAPIThreadLocalEntry *ret = MXAPIThreadLocalStore::Get();
    ret->ret_handles.clear();
    for (const auto& nd : in_arg_vec) {
        ret->ret_handles.push_back(new NDArray(nd))
    }
}

I think this behavior could be changed, so that the thread local storage is really a map from an executor handle to a vector of NDArray objects, then the MXExecutorReshape() call wouldn’t have to wipe out anything, it would just set the thread local “ret_handles” array appropriate for the newly created executor object. Then all the other c api functions that use the thread local “ret_handles” vector would have to be modified to first get the right vector based on the ExecutorHandle they are passed. For example:

MXAPIThreadLocalEntry *ret = MXAPIThreadLocalStore::Get();
ret->ret_handles.get(executor_handle) <--- this is the Vector<NDArray> to use for executor_handle

If anyone could comment on the above, namely: Does this sound like a correct understanding of the current MxNet landscape with respect to handling varying shape inputs? If not, please help me understand better and suggest alternatives. If so, please let me know if my proposed solution sounds feasible, and if so I can try to put in a patch.

Best,
Stephen


#2

@reminisce @eric-haibin-lin


#3

Maybe I missed something here, but in your proposed solution, if you already have created multiple executors of different input shapes, what’s the point of using a map storing executor -> vector<NDArray>?


#4

Hi Reminisce,

First, I should mention that I now have a work around that is giving me a working solution. So the below isn’t time critical for me, but I still think there is a bug in the current api that would be useful to resolve.

The problem with the current API is that calls to MXExecutorOutputs() return the following:

*out = dmlc::BeginPtr(ret->ret_handles);

That is, they return a pointer to the data being held by ret->ret_handles, i.e. &(ret->ret_handles[0]). But when you create a new executor, either by reshaping or calling BindEX or even just creating a brand new executor, the ret->ret_handles vector will change and invalidate the old pointers returned by MXExecutorOutputs(). This is being treated as a global variable in the c api code.

So basically what it means is the following:

  1. Even though MXExecutorOutputs() takes an ExecutorHandle object, future calls to Bind invalidate the use of MXExecutorOutputs() for previously created ExecutorHandle objects. It is only valid to call this function for the most recently created ExecutorHandle object.
  2. One work around to this is to call MXExecutorOutputs() immediately after creating an executor, and just hold onto the results it gives you, so you don’t have to call it again later (which as mentioned would be invalid). However the NDArrayHandle* that you get out of MXExecutorOutputs() is itself invalidated during future calls to Bind, etc. (Because, as mentioned, MXExecutorOutputs() gives you a direct pointer to a global vector’s internal data, which will change out from under clients without them being aware).
  3. So the current work around is that users have to make a copy of the NDArrayHandle* array that MXExecutorOutputs() gives you. All the individual NDArrayHandle objects inside the array remain valid, so after manually copying them all is fine.

Here is a short code sample to show my point:

// assume executor is a previously bound executor handle
mx_uint out_size;
NDArrayHandle *output_handles;
MXExecutorOutputs(executor, &out_size, &output_handles);
vector<NDArrayHandle> outputs(output_handles, output_handles + out_size);

In particular, the output_handles array you get out of MXExecutorOutputs is unsafe to use after future calls to Bind, but the vector of NDArrayHandle’s created above from an immediate copy of output_handles remains safe to use.

To me this seems like a bug in the API, and my solution of having the c api internally store a map between executor -> vector would seem to make the API less brittle to use by removing this bug. (The NDArrayHandle* that MXExecutorOutputs() gives you would no longer be invalidated when new executors get created, because each executor has its own vector. The only other required change is that now MXExecutorFree() would be expected to remove this vector from the internal map so that the map doesn’t grow unbounded).

Note that for cases where someone makes two executors with different size shapes then the current API gives you undefined behavior in the following situation:

// small_executor is created via Bind and has 10 output variables
mx_uint small_out_size;
NDArrayHandle *small_out_vars;
MXExecutorOutputs(small_executor, &small_out_size, &small_out_vars);
// large_executor is created via Bind and has 20 output variables
mx_uint large_out_size;
NDArrayHandle *large_out_vars;
MXExecutorOutputs(large_executor, &large_out_size, &large_out_vars);

The problem with the above code is that small_out_vars is a pointer to the ret->ret_handle vector object’s internal data pointer. BUT when we subsequently create a larger executor we have to reserve more space in this vector, and it will allocate more memory to handle that, and so that small_out_vars ends up being a dangling pointer that causes undefined behavior if used.

Stephen


#5

Hi @stephenrawls,

Are you using mxnet c predict API https://github.com/apache/incubator-mxnet/tree/master/example/image-classification/predict-cpp or general c APIs? For variable length inputs, the python package has BucketingModule which creates multiple executors with shared parameters https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/module/bucketing_module.py. The most important part of it is that it makes use of the simple_bind c api with shared executor/data arrays. Do you only consider using c for inference? I think the most “proper” way is to port similar logic to c (admittedly, it may take lots of time to implement that…).

Is your goal to achieve memory sharing, or simply make it work? What about starting multiple threads, each one of them creates an independent executors (without memory sharing) to avoid the thread local issue?


#6

@eric-haibin-lin,

I am using the MxNet C API, aka the general C API, to create a production inference code base.

I looked into using the C++ API, but it is not as full featured as the C API (e.g. I cannot create int32 NDArray’s using the C++ API). I also looked at the C Predict API, but it has a few downsides for my use case:

  • It doesn’t support int32 inputs, only float32
  • The MXPredSetInput() and MXPredGetOutput() functions do synchnous copies which is inefficient for my NLP-style workload where I am copying multiple data arguments for each call, and copying multiple output arrays in each call. I would prefer to have asynchronous copy commands and then perform my own synchronization when needed.
  • Our use case involves doing some post-processing on the model output, and we like the flexibility of being able to do that post-processing on the GPU. Because the Predict API gives you float* output pointers instead of an NDArrayHandle, this is not possible in the Predict API.

That said, I have already re-created the logic of the bucketing executor in C using the general C API and it appears to be working fine for me.

My only remaining issue is the bug in the API. I have a work around for it so it’s not a blocking issue for me. But I would say that either the current behavior is a bug, or if it is not considered a bug, then it should be advertised very clearly that the output NDArrayHandle* array given by MXExecutorGetOutputs is valid only until the next bind call, at which time it becomes potentially undefined behavior to access that pointer.