How to save model in distributed training?


#1

hi, from the https://mxnet.incubator.apache.org/faq/multi_devices.html , i have successfully running the demo, but, how can I save the model?
all the worker use the same py code, how do i know the master node to save the model.


#2

If you are using data parallelism, all the nodes will have the same copy of the model. It doesn’t matter on which node you save the model as they will all have the same weights because of the shared gradient updates. You can just export the model at the end of your training or have checkpoints save the model periodically.