Single-node multi-gpu machine


#1

Hi, I looked at this tuto https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html and it seems that on single-machine, multi-gpu training, gluon handles parameter synchronization, and from this https://mxnet.incubator.apache.org/_modules/mxnet/gluon/trainer.html it seems that default kvstore for Trainer is device. From http://mxnet.incubator.apache.org/test/versions/0.10/api/python/kvstore.html I see that in device kvstore mode, the weight updates are done in GPU, in peer-to-peer mode when possible. Does that mean that there is no parameter server when doing training with gluon in single machine, multi-gpu mode?


#2

Hi,

You’re correct that the parameter aggregation is done on the GPU with the ‘device’ mode. In single instance training, a KVStore is initialized by the trainer in the step() method, so you do actually have a KVStore regardless. See

To contrast, in multi-host training you can specify which hosts are workers, and servers (have a kvstore). A node can have both.
Take a look at https://mxnet.incubator.apache.org/faq/distributed_training.html for multi-host data parallelization examples. You’ll want to use one of the dist_sync/dist_async/dist_device_sync/dist_device_async modes for kvstore which describe synchronous updates vs asynchronous updates and cpu vs gpu aggregation.

Hope that helps,
Vishaal


#3

thanks, much clearer. Still have a questions:
(1) in single instance training, how is the gradient update to the KV-store performed? you say that aggregation is done on GPU, but what if you have multiple GPUs? every GPU gets gradients from all other GPUs? or is it using a smarter update scheme like horovod’s ring-style reduction? (https://eng.uber.com/horovod/)
(2) a popular multi-node MXNet deployment is the one featured there https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/, where dedicated, non-training instance are used to host the KV store. In this setting, since KV store is hosted on CPU-only machine, there is no possibility to use dist_device_async right?
thanks again!


#4

Hi,

  1. In local mode with multiple GPUs, there is a KVStore to store parameters, and the trainer is responsible for the doing the aggregation (update). The gradients would be computed in their respective contexts - the GPUs on which their parameters were defined. A gradient wouldn’t be computed on a GPU in another context. In multi-instance training you can do the update on the parameter servers. Note, Horovod is a distributed strategy involving multiple nodes so it wouldn’t be applicable for local training. There’s discussion about Horovod integration for MXNet you might be interested in reading: https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
  2. If the Parameter servers are hosted on CPU-only instances, then you wouldn’t be able to use ‘device’, you’re correct. You’d have to use ‘dist_async’ or ‘dist_sync’. You could only use ‘device’ if the parameter servers had GPUs.

Vishaal