I have been trying the distributed training using money. I have successfully created the cluster with two workers, one server and one scheduler using ssh.
Now I am reading the source code of distributed training. From the tools/launch.py, it submit the arguments to ssh.py in dmlc-core/tracker/dmlc-core. Then the scheduler is started in dmlc-core/tracker/dmlc-core/tracker.py.
After that, thread is started on the server and worker machine. I have tried to output the command run on each machine, like the following image
Therefore, both server and worker should run the training code, which in this case should be example/image-classification/train_minst.py. However, when I put a file output in the main function. Only worker goes into the function while there is no file output on the server.
I wonder what does server goes into then? I assume there should be some mechanism which let server go to other function and process the gradient sent by the workers afterwards. But I cannot trace the corresponding code.
I wonder if anyone could offer any help. Thanks a lot for your help:)