Trainer.step doesn't block, is this safe, is this faster?

Consider the following code:

ctx = mx.gpu()
l_total = nd.zeros(1, ctx, dtype=np.float32)

for x, y in dataloader:
    x = x.as_in_context(ctx)
    y = y.as_in_context(ctx)
    with autograd.record():
        l = loss(net(x), y)
    l_total += l.mean()

Until recently, I’ve always stored l_total as a normal float and accumulated it via l_total += l.mean().asnumpy(), which also acted as a block (like mx.nd.waitall()). So, after I switched l_total to an NDArray on the GPU, I was very surprised to find out, that trainer.step doesn’t block execution on its own.

With this code, the GPU is first filled with batches to it’s memory capacity and only then does the computation start (assuming the net itself is slower than the speed at which batches load). Which seems really weird, since each batch should logically depend on the results of the previous batch (since the weights of the network must be updated by the trainer).

So my questions are:

  • Without barriers, does trainer.step preserve the ordering of batches and correct weights updates.
  • If not, I think trainer.step should block by default, since most users won’t expect this behavior.
  • If it does update the weight correctly, what are the performance implications of not blocking? On one hand, not having barriers and not needing to pull values from the GPU on each batch is generally considered good. However, I am having doubts. There might be (?) performance degradation, because the GPU memory is filled or because of cache misses or any other 101 reasons.


This is the default execution mode of MXNet. All operations are added to an execution Stream and therefore scheduled to be executed asynchronously (and the graph dependencies are conserved).

Trying to access resulting NDarrays in Python (with wait_for_all or asnumpy) just makes the frontend wait for results and this does not impact the execution. Other that the fact than you can’t schedule further operations since Python is waiting.

You can also find info about it in this thread:

Hi, thank you for your answer. I am aware of the asynchronous execution nature of MXNet. However, this doesn’t quite answer my question.

Firstly, it is not obvious to me, that from “graph dependencies are conserved” follows “weight updates are done sequentially with all the other operations”. I am guessing, that they are, but I wasn’t sure, since I don’t know, whether Trainers are considered a part of the computational graph.

Just to be clear. In the following example

with autograd.record():
    l1 = loss(net(x1), y))
trainer.step(x1.shape[0]) # 1
result = net(x2).asnumpy() # 2
l1.asnumpy() # 3

Is it guaranteed, that the forward pass at line #2 will use the updated weights after trainer.step(...) and not the original ones? The update of the weights was scheduled at line #1, however since trainer.step(...) itself doesn’t have a barrier, it is plausible, that the weights are updated only after #3, when the loss itself is actually requested by the frontend.

The second issue – performance, is also a little bit tricky. The view, that barriers like mx.nd.waitall() “don’t impact execution” is a simplification and in my experience doesn’t always quite work.

For example, you can accidentally cause an OOM error, if you are not careful with not waiting for the execution results. I suspect, that not waiting for the batch to finish would allow the frontend to start loading the next batch into the GPUs memory, while the GPU is working on the previous batch, but once again, I am not sure, hence my question.