Is it still possible to add worker node during training?


#1

Hi, newbie to MXnet here.
After reading your paper about parameter sever, I started to think about making full use of such design.
In a word, I’m trying to build a demo to dynamically add worker and shutdown worker during training.
By doing so, we can actually use Spot Instance in training to provide more affordable training .
I checked with few tutorial, it seems like you need to pre-configure everything before training.
The issue here https://github.com/apache/incubator-mxnet/issues/7320 is kinda negative about this.
but the issue here https://github.com/apache/incubator-mxnet/issues/4867 is providing me with some hope.
So my question is “Is it still possible to do so?”


#2

I don’t think this is possible today. At least not without some code changes. Once the cluster is fully set up i.e. the required number of servers and workers have joined, the scheduler notifies everyone who joined of the roles/locations of all other nodes in the cluster. After this the schedule does not do much. It does not actively monitor cluster health and or accept any new nodes. So replacing nodes in a running cluster cannot happen. That said, this isn’t a impossible task. Might need someone to dig into the ps-lite code.


#3

with ps-lite, you have to specify the server number and worker number when cluster starts.