Dist training with geographically very distant servers


I have a use-case where I want to train a model with dist SGD, using worker nodes which are in different locations (cities or countries), and which all have their share of the data. I want the nodes to share only gradients and parameters with other nodes, not raw data.
Is there any reason why this would not be possible to implement with MXNet’s default parameter server?

It should be fine, data is not shared by the parameter server, only gradient updates.

1 Like

Yes, my main interrogation is about how to connect heterogenous machines (eg multi-cloud); but I guess i’ll give a try and see if there are specific pb arising!