Consider the following code:
ctx = mx.gpu() l_total = nd.zeros(1, ctx, dtype=np.float32) for x, y in dataloader: x = x.as_in_context(ctx) y = y.as_in_context(ctx) with autograd.record(): l = loss(net(x), y) l.backward() l_total += l.mean() trainer.step(x.shape)
Until recently, I’ve always stored
l_total as a normal float and accumulated it via
l_total += l.mean().asnumpy(), which also acted as a block (like
mx.nd.waitall()). So, after I switched
l_total to an NDArray on the GPU, I was very surprised to find out, that
trainer.step doesn’t block execution on its own.
With this code, the GPU is first filled with batches to it’s memory capacity and only then does the computation start (assuming the
net itself is slower than the speed at which batches load). Which seems really weird, since each batch should logically depend on the results of the previous batch (since the weights of the network must be updated by the trainer).
So my questions are:
- Without barriers, does
trainer.steppreserve the ordering of batches and correct weights updates.
- If not, I think
trainer.stepshould block by default, since most users won’t expect this behavior.
- If it does update the weight correctly, what are the performance implications of not blocking? On one hand, not having barriers and not needing to pull values from the GPU on each batch is generally considered good. However, I am having doubts. There might be (?) performance degradation, because the GPU memory is filled or because of cache misses or any other 101 reasons.