[gluon]How to do distribute training with internal implemented ps?

shuokay · May 3, 2018, 2:02pm

I am using our internal implemented parameter server. Is there anyone can give me an example about how to do a distribute training using gluon API.

thomelane · May 3, 2018, 5:12pm

You can find a great tutorial for distributed training using Gluon here. And another here.

Although not Gluon specific this video gives a good walk through of distributed training with MXNet. And another can be found here.

A fully working example of distributed training can be found here which is used for image classification.

thomelane · May 3, 2018, 5:18pm

You’ll see the main ideas are:

Creating a distributed key value store with mxnet.kv.create('dist')
Sampling different batches of the data on each of the workers
split_and_loading partitions of each batch to the devices on the corresponding worker

Topic		Replies	Views
Distributed training questions Gluon	28	5133	January 11, 2021
Lower accuracy on Cifar10 with multi-gpu implementation	5	601	August 23, 2018
Data parallelism for ConvLSTM Gluon	3	700	July 25, 2019
Distributed Training / Model Parallelism with sparse embeddings in Gluon Gluon	2	538	June 19, 2019
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	765	November 14, 2018