Async predictions without blocking in C++ - a hack

I’m just learning mxnet and got to the part where the API is mostly not thread safe and the there are only blocking calls (eg WaitAll()).

@kellen brought up Running Async Predictions how a “can_read” would help alleviate being able to queue forwards up, and reading results only when ready.

I tossed together only what I can describe as a hack, but seems to work with limited testing.
The idea is to make an cpu memory NDArray with a “token”, then after the forward, initiate an async copy on the net output to your cpu memory NDArray. Checking if its ready to read, involves checking to see if the “token” has been overwritten.

Like I mentioned, I’m new to mxnet, so this may be a bad idea, but I dont believe there to be any threading issues…And it doesn’t get away from any ndArray.SyncCopyFromCPU blocks.

The questionable parts do involve a const cast on the ndArray.GetData() pointer, but the C API isn’t marked const so I figured it’s safe from a CPU memory NDArray.

I wish there was a better way than polling. A callback would be preferable.

Anyways here’s the code to mark the buffer with a token:

    /* Arbitrary number */
    const float NOT_READY_TOKEN = 1234321.1234f;

    std::vector<NDArray> outputs;


    for (const auto& output : executor->outputs)
        auto output_shape = output.GetShape();

        const int shape_size = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<>());

        /* Create output NDArray on */
        NDArray out(output_shape, Context(kCPU, 0), false, output.GetDType());

        const auto data = out.GetData();

        /* Get last pointer in NDArray */
        float& fdata = const_cast<float&>(*(data + shape_size - 1));

        /* Set our not ready token */
        fdata = NOT_READY_TOKEN;

        /* queue async copy to our output NDArray*/


And here is how to check if its ready to read:

bool check_output_ready(const std::vector<NDArray>& outputs)
    for(const auto& out : outputs)
        const auto data = out.GetData();

        auto output_shape = out.GetShape();

        const int shape_size = std::accumulate(output_shape.begin(), output_shape.end(), 1, std::multiplies<>());

        const auto fdata = data[shape_size - 1];

        if(fdata == NOT_READY_TOKEN)
            return false;

    return true;

Just an update.

With the above, I have a libuv based loop (single threaded) that fires a callback whenever a forward’s result is ready and it is running really well.

Another interesting thing, is now that I read the results only when they are available, results sometimes come back out of order from the order the fowards were issued. This is a benefit so if there is a heavy running GPU model, and a lighter weight one, we do not have to wait for the heavy one to finish before processing the lighter one. (Or make guesses on which one will finish first and sync read the result).

So now I am able to queue up as many forwards, within reason, as I want and process when they are ready, giving much better latency.

Again, I’m new to mxnet, so don’t take anything here as authoritative. Just wanted to share.

1 Like

Hello @jmorrill,
interesting idea! I’m not sure if I understood your concept correctly, but can’t you use .WaitToRead() instead of check_output_ready()?

I also experienced problems, when running predictions asynchronously in C++.
Therefore, I created an individual executor for each thread.
Unfortunately, however this means that you have call .bind() on each executor (#16173) and can result in a long start-up time.

Here is a the corresponding code segment for this:

1 Like

It was my understanding that the “prescribed” method of using mxnet, was to create a single thread that will call mxnet APIs, and for the life of the application to use that thread to dispatch calls to mxnet. Checking the mxnet source, with the “static” engine, that lead me to believe things weren’t threadsafe (beyond some Predictor API).

Assuming I may do a forward on a CPU net and maybe multiple GPUs at the same time, a WaitToRead call would block the only thread that I believed can talk mxnet.

Maybe a more obvious explanation: So say I did two forwards, back to back, one to each GPU. What if GPU-B finished before GPU-A, but I’m doing a WaitToRead on GPU-B’s output. Now GPU-B isn’t doing anything, while we wait on GPU-A to finish.

With an “is_output_ready()”, we can poll an output (I hate polling, but its better than blocking) on an event loop, without blocking the mxnet thread, and read the results in the order they complete.

I am using a proprietary async library I made (based on Microsoft’s open source ppl task library and libuv) to compose the async like this (code all runs on same libuv thread):

template<typename T> task<T> wait_for_predictor_result(const T& predictor_result, 
                                                       milliseconds poll_duration = milliseconds(3), 
                                                       const cancellation_token& ct = cancellation_token::none())
    return uvxx::pplx::create_iterative_task([=]
        const bool is_ready = predictor_result->is_output_ready();

        return is_ready ? task_from_result(false) : create_timer_task(poll_duration).then([predictor_result]
            return true;
        }, task_options(ct));
    }, task_continuation_context::use_current(), ct).then([=]
        return predictor_result;
    }, ct);

You say you are doing a thread for each executor? This does tweak my understanding of mxnet as I thought it wasn’t supported, but that certainly is a welcome ability that I will probably implement also! Thanks for sharing that!

1 Like