Distributed Training questions about server and worker

serendipityCoding · July 14, 2018, 12:53pm

Hi,
I have been trying the distributed training using money. I have successfully created the cluster with two workers, one server and one scheduler using ssh.
Now I am reading the source code of distributed training. From the tools/launch.py, it submit the arguments to ssh.py in dmlc-core/tracker/dmlc-core. Then the scheduler is started in dmlc-core/tracker/dmlc-core/tracker.py.
After that, thread is started on the server and worker machine. I have tried to output the command run on each machine, like the following image
WechatIMG633
Therefore, both server and worker should run the training code, which in this case should be example/image-classification/train_minst.py. However, when I put a file output in the main function. Only worker goes into the function while there is no file output on the server.
I wonder what does server goes into then? I assume there should be some mechanism which let server go to other function and process the gradient sent by the workers afterwards. But I cannot trace the corresponding code.
I wonder if anyone could offer any help. Thanks a lot for your help:)

Topic		Replies	Views
Jointly training with 2 Trainer Gluon	1	606	December 12, 2018
Location of KVStore update Discussion	3	434	October 4, 2018
Loaded pretrained params but got different results for the same input Discussion	9	845	July 12, 2019
Is there any way to preserve the internal variables of a trainer? Gluon	2	369	August 11, 2018
Why deferred initialization? Gluon	3	1716	August 29, 2018

Distributed Training questions about server and worker

Related Topics