How to run distributed training on my own cluster NOT AWS?


#1

As I’ve seen by far, all distributed training are all performed on AWS. However we have a cluster of Xeon E5 with 100G OPA, we are willing to try mxnet on this cluster. I found on document here https://mxnet.incubator.apache.org/tutorials/python/kvstore.html

which says that
’'
Run on Multiple Machines

Based on parameter server, the updater runs on the server nodes. When the distributed version is ready, we will update this section.

''
does that mean mxnet do not support custom cluster distributed training currently?

If so, any plan when will this be implemented? or in fact its not gonna be done?


#2

There isn’t anything special about running distributed training in EC2. In your setup you just need to make sure nodes are on the same network and can communicate with each other. The scheduler or driver node (this is the node where you run the launch.py script) should able to login to all other nodes via password less ssh (if you are using ssh to launch training on individual nodes). In AWS all nodes mount a EFS volume. This is convineient but all you need is for code and data to be accessible at a preset location on each node.