Dist training with geographically very distant servers

olivcruche · August 23, 2019, 8:48pm

Hi,

I have a use-case where I want to train a model with dist SGD, using worker nodes which are in different locations (cities or countries), and which all have their share of the data. I want the nodes to share only gradients and parameters with other nodes, not raw data.
Is there any reason why this would not be possible to implement with MXNet’s default parameter server?

ThomasDelteil · September 1, 2019, 5:36pm

It should be fine, data is not shared by the parameter server, only gradient updates.

olivcruche · September 5, 2019, 5:27pm

Yes, my main interrogation is about how to connect heterogenous machines (eg multi-cloud); but I guess i’ll give a try and see if there are specific pb arising!

Topic		Replies	Views
Is there any communication between parameter servers?	1	375	September 9, 2019
Where is the scheduler in distributed learning?	0	215	November 29, 2020
How to run distributed training on my own cluster NOT AWS?	1	647	November 7, 2017
How to save model in distributed training? Discussion	1	422	March 21, 2018
Distribution Class for MxNet Discussion	2	570	January 23, 2018

Dist training with geographically very distant servers

Related Topics