Consider the following code:
ctx = mx.gpu()
l_total = nd.zeros(1, ctx, dtype=np.float32)
for x, y in dataloader:
x = x.as_in_context(ctx)
y = y.as_in_context(ctx)
with autograd.record():
l = loss(net(x), y)
l.backward()
l_total += l.mean()
trainer.step(x.shape[0])
Until recently, I’ve always stored l_total
as a normal float and accumulated it via l_total += l.mean().asnumpy()
, which also acted as a block (like mx.nd.waitall()
). So, after I switched l_total
to an NDArray on the GPU, I was very surprised to find out, that trainer.step
doesn’t block execution on its own.
With this code, the GPU is first filled with batches to it’s memory capacity and only then does the computation start (assuming the net
itself is slower than the speed at which batches load). Which seems really weird, since each batch should logically depend on the results of the previous batch (since the weights of the network must be updated by the trainer).
So my questions are:
- Without barriers, does
trainer.step
preserve the ordering of batches and correct weights updates. - If not, I think
trainer.step
should block by default, since most users won’t expect this behavior. - If it does update the weight correctly, what are the performance implications of not blocking? On one hand, not having barriers and not needing to pull values from the GPU on each batch is generally considered good. However, I am having doubts. There might be (?) performance degradation, because the GPU memory is filled or because of cache misses or any other 101 reasons.