Both https://nnabla.org/paper/imagenet_in_224sec.pdf and http://aclweb.org/anthology/W18-6301 mention overlapping gradient sync with backward computation as a trick to increase training speed in distrib training settings. Can this be done in MXNet? How would it happen in gluon?
The short answer is: MXNet engine already does that.
The long answer: MXNet Engine is an asynchronous dependency engine that performs operations as soon as their dependency is resolved (and resources are available). When you follow a
backward() call with
trainer.step() call, all the gradient synchronization ops are queued as well as backward operations. This means that as soon as
dL/dW is calculated for some weight, the optimization for that weight can resume immediately without waiting for the rest of backward to complete. The same goes for other operations, such as copying data between CPU and GPU.
excellent thanks Sina, that’s a great illustration of the power of the asynchronous dependency engine.