Single-node multi-gpu machine


Hi, I looked at this tuto and it seems that on single-machine, multi-gpu training, gluon handles parameter synchronization, and from this it seems that default kvstore for Trainer is device. From I see that in device kvstore mode, the weight updates are done in GPU, in peer-to-peer mode when possible. Does that mean that there is no parameter server when doing training with gluon in single machine, multi-gpu mode?



You’re correct that the parameter aggregation is done on the GPU with the ‘device’ mode. In single instance training, a KVStore is initialized by the trainer in the step() method, so you do actually have a KVStore regardless. See

To contrast, in multi-host training you can specify which hosts are workers, and servers (have a kvstore). A node can have both.
Take a look at for multi-host data parallelization examples. You’ll want to use one of the dist_sync/dist_async/dist_device_sync/dist_device_async modes for kvstore which describe synchronous updates vs asynchronous updates and cpu vs gpu aggregation.

Hope that helps,


thanks, much clearer. Still have a questions:
(1) in single instance training, how is the gradient update to the KV-store performed? you say that aggregation is done on GPU, but what if you have multiple GPUs? every GPU gets gradients from all other GPUs? or is it using a smarter update scheme like horovod’s ring-style reduction? (
(2) a popular multi-node MXNet deployment is the one featured there, where dedicated, non-training instance are used to host the KV store. In this setting, since KV store is hosted on CPU-only machine, there is no possibility to use dist_device_async right?
thanks again!



  1. In local mode with multiple GPUs, there is a KVStore to store parameters, and the trainer is responsible for the doing the aggregation (update). The gradients would be computed in their respective contexts - the GPUs on which their parameters were defined. A gradient wouldn’t be computed on a GPU in another context. In multi-instance training you can do the update on the parameter servers. Note, Horovod is a distributed strategy involving multiple nodes so it wouldn’t be applicable for local training. There’s discussion about Horovod integration for MXNet you might be interested in reading:
  2. If the Parameter servers are hosted on CPU-only instances, then you wouldn’t be able to use ‘device’, you’re correct. You’d have to use ‘dist_async’ or ‘dist_sync’. You could only use ‘device’ if the parameter servers had GPUs.