How to run distributed training on my own cluster NOT AWS?

FayW · November 7, 2017, 3:04am

As I’ve seen by far, all distributed training are all performed on AWS. However we have a cluster of Xeon E5 with 100G OPA, we are willing to try mxnet on this cluster. I found on document here https://mxnet.incubator.apache.org/tutorials/python/kvstore.html

which says that
’'
Run on Multiple Machines

Based on parameter server, the updater runs on the server nodes. When the distributed version is ready, we will update this section.

''
does that mean mxnet do not support custom cluster distributed training currently?

If so, any plan when will this be implemented? or in fact its not gonna be done?

madjam · November 7, 2017, 3:30pm

There isn’t anything special about running distributed training in EC2. In your setup you just need to make sure nodes are on the same network and can communicate with each other. The scheduler or driver node (this is the node where you run the launch.py script) should able to login to all other nodes via password less ssh (if you are using ssh to launch training on individual nodes). In AWS all nodes mount a EFS volume. This is convineient but all you need is for code and data to be accessible at a preset location on each node.

Topic		Replies	Views
Dist training with geographically very distant servers Performance	2	402	September 5, 2019
One node failure but other nodes hang in mulit-node distributed training Discussion	0	307	November 18, 2020
How to save model in distributed training? Discussion	1	420	March 21, 2018
Question about Distribution Training using launcher.py Discussion multi-host , unix-based	3	468	February 19, 2019
Where is the scheduler in distributed learning?	0	214	November 29, 2020

How to run distributed training on my own cluster NOT AWS?

Related Topics