Async updates with Gluon Trainer with multiple devices on one node

bejjani · January 27, 2018, 6:39am

Is it possible to perform gradient updates asynchronously à la dist_async mode of kvstore but on a single machine with multiple GPUs?
My understanding is that a call to gluon.Trainer.step when kvstore=‘device’ will wait for all the gradients to be available from all devices before performing the update.

rahul003 · February 10, 2018, 1:13am

There’s no way to do this currently. Are your devices of different types? If not, all gradients should be available at almost the same time.

bejjani · February 10, 2018, 7:51am

All my devices are the same, so like you said gradients are available at almost the same time.
It is less of a wall clock time performance consideration here but more of a mean of being able to use async updates to converge to a potentialy better solution in some cases.
In a sparse setting for example - recommender system - the chance of having the devices updating the same weights are so tiny that it is wasteful to aggregate the gradients and sync them instead of just performing the updates locally on the device as soon as they are available. A sync could only be needed periodically between the devices to combine their gradient by averaging for example.

Topic		Replies	Views
Gradient compression default? Gluon	4	559	December 7, 2018
Using multiple gluon trainers with kvstore Gluon	3	429	July 3, 2020
Multiple GPUs RNNs grad clip Discussion	1	864	June 12, 2018
Distributed training questions Gluon	28	5115	January 11, 2021
[gluon]How to do distribute training with internal implemented ps? Gluon	2	433	May 3, 2018

Async updates with Gluon Trainer with multiple devices on one node

Related Topics