Distributed training questions

feevos · April 9, 2019, 8:45am

Hello again,

two more questions:

When I am using the parameter sever model (based on the official mxnet tutorial - my full code is here) and I print the test loss/accuracy etc, these are different in different machines:

Epoch 0: Test_mcc 0.386387: test_Tnmt: 0.671969
Epoch 0: Test_mcc 0.370691: test_Tnmt: 0.693026

Question: is this happening because each worker has a different set of initial weights? Having followed the horovod distributed training tutorial (haven’t managed to make it work yet), they explicitely mention to broadcast parameters to all workers. Do we need to manually broadcast all parameters to all servers before training (or give the same seed to all machines?) or is this taken care of for us?

In the same tutorial with horovod, they mention that using a ratio of server/worker ~ 2 for the parameter server model training gives better scaling performance. Is this universal (applies to most problems/training?).

Thank you for your time,
Foivos

Topic		Replies	Views
Data parallelism for ConvLSTM Gluon	3	694	July 25, 2019
[gluon]How to do distribute training with internal implemented ps? Gluon	2	433	May 3, 2018
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	763	November 14, 2018
mxnet.gluon.data.DataLoader / RandomSampler shuffle Gluon	2	865	May 2, 2018
Gluon NLP Batchify Gluon	1	389	November 26, 2019

Distributed training questions

Related Topics