Overlap gradient communication with backward pass

olivcruche · January 16, 2019, 5:38pm

Both https://nnabla.org/paper/imagenet_in_224sec.pdf and http://aclweb.org/anthology/W18-6301 mention overlapping gradient sync with backward computation as a trick to increase training speed in distrib training settings. Can this be done in MXNet? How would it happen in gluon?

safrooze · January 16, 2019, 6:53pm

The short answer is: MXNet engine already does that.

The long answer: MXNet Engine is an asynchronous dependency engine that performs operations as soon as their dependency is resolved (and resources are available). When you follow a backward() call with trainer.step() call, all the gradient synchronization ops are queued as well as backward operations. This means that as soon as dL/dW is calculated for some weight, the optimization for that weight can resume immediately without waiting for the rest of backward to complete. The same goes for other operations, such as copying data between CPU and GPU.

olivcruche · January 16, 2019, 8:09pm

excellent thanks Sina, that’s a great illustration of the power of the asynchronous dependency engine.

Topic		Replies	Views
Attempting to use augmentation during training Gluon	1	425	August 24, 2021
Forward pass performance (for one image) is quite slow. Concerns mxnet 0.11.0 Performance	2	1049	January 23, 2018
MXnet error: Backward Norm Operator not implemented in GPU Performance	4	937	August 24, 2018
Forward-backward pass being a bottleneck in multi-gpu training	3	1046	July 12, 2019
Question about distributed Synchronous and Asynchronous training Gluon	0	316	December 5, 2019

Overlap gradient communication with backward pass

Related Topics