[gluon]How to do distribute training with internal implemented ps?


#1

I am using our internal implemented parameter server. Is there anyone can give me an example about how to do a distribute training using gluon API.

  1. how to split and load data on different machine;
  2. how to compute gradient on slaver machine;
  3. how to gather gradient from slaver machine and update parameters on master.

#2

Hi @shuokay,

You can find a great tutorial for distributed training using Gluon here. And another here.

Although not Gluon specific this video gives a good walk through of distributed training with MXNet. And another can be found here.

A fully working example of distributed training can be found here which is used for image classification.


#3

You’ll see the main ideas are:

  1. Creating a distributed key value store with mxnet.kv.create('dist')
  2. Sampling different batches of the data on each of the workers
  3. split_and_loading partitions of each batch to the devices on the corresponding worker