Restarting training running on multiple machines if one of the machine dies


I am running resnet training on multiple machines using mxnet kvstore. If one of the machine dies (and gets restarted), is there a way for me to restart the training on that machine without having to shutdown training on other machines and restart the training on all machines with stored weights?



Are you running kvstore in dist_async or dist_sync mode?
When running in dist_async mode, a node (worker) being down should not impact training on other workers as workers don’t tightly synchronize with each other. However, if you are running in dist_sync mode, for each mini-batch gradients are aggregated from all workers before the weights are updated (on server). So even if one worker is down training on other workers halts.


Thanks for your response. I am currently running in dist_sync mode. What happens if I restart the machine and start the mxnet program, will the worker will then reconnect with the scheduler and start processing again from where it stopped? Or do I have to restart the scheduler and other workers?


I don’t think restarting a worker will result in the worker resuming from where it left off. This hasn’t been tested so hard to tell if it works or if it ever worked. So the best course of action is to restart everything. However, by saving checkpoints periodically you can avoid restarting training from scratch. When you (re)start training simply check and load the most recent checkpoint, if one exists, and continue training from there.