Single-node multi-gpu machine

olivcruche · October 10, 2018, 12:09pm

Hi, I looked at this tuto https://gluon.mxnet.io/chapter07_distributed-learning/multiple-gpus-gluon.html and it seems that on single-machine, multi-gpu training, gluon handles parameter synchronization, and from this https://mxnet.incubator.apache.org/_modules/mxnet/gluon/trainer.html it seems that default kvstore for Trainer is device. From http://mxnet.incubator.apache.org/test/versions/0.10/api/python/kvstore.html I see that in device kvstore mode, the weight updates are done in GPU, in peer-to-peer mode when possible. Does that mean that there is no parameter server when doing training with gluon in single machine, multi-gpu mode?

VishaalKapoor · October 10, 2018, 6:18pm

Hi,

You’re correct that the parameter aggregation is done on the GPU with the ‘device’ mode. In single instance training, a KVStore is initialized by the trainer in the step() method, so you do actually have a KVStore regardless. See

github.com

apache/incubator-mxnet/blob/d02c9eb825e3e9b678363392fca542e9c39f3dcb/python/mxnet/gluon/trainer.py#L272




    Parameters
    ----------
    batch_size : int
        Batch size of data processed. Gradient will be normalized by `1/batch_size`.
        Set this to 1 if you normalized loss manually with `loss = mean(loss)`.
    ignore_stale_grad : bool, optional, default=False
        If true, ignores Parameters with stale gradient (gradient that has not
        been updated by `backward` after last step) and skip update.
    """
    if not self._kv_initialized:
        self._init_kvstore()
    if self._params_to_init:
        self._init_params()


    self._optimizer.rescale_grad = self._scale / batch_size


    self._allreduce_grads()
    self._update(ignore_stale_grad)


def allreduce_grads(self):

To contrast, in multi-host training you can specify which hosts are workers, and servers (have a kvstore). A node can have both.
Take a look at https://mxnet.incubator.apache.org/faq/distributed_training.html for multi-host data parallelization examples. You’ll want to use one of the dist_sync/dist_async/dist_device_sync/dist_device_async modes for kvstore which describe synchronous updates vs asynchronous updates and cpu vs gpu aggregation.

Hope that helps,
Vishaal

olivcruche · October 10, 2018, 8:02pm

thanks, much clearer. Still have a questions:
(1) in single instance training, how is the gradient update to the KV-store performed? you say that aggregation is done on GPU, but what if you have multiple GPUs? every GPU gets gradients from all other GPUs? or is it using a smarter update scheme like horovod’s ring-style reduction? (https://eng.uber.com/horovod/)
(2) a popular multi-node MXNet deployment is the one featured there https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/, where dedicated, non-training instance are used to host the KV store. In this setting, since KV store is hosted on CPU-only machine, there is no possibility to use dist_device_async right?
thanks again!

VishaalKapoor · October 13, 2018, 12:40am

Hi,

In local mode with multiple GPUs, there is a KVStore to store parameters, and the trainer is responsible for the doing the aggregation (update). The gradients would be computed in their respective contexts - the GPUs on which their parameters were defined. A gradient wouldn’t be computed on a GPU in another context. In multi-instance training you can do the update on the parameter server. Note, Horovod is a distributed strategy involving multiple nodes so it wouldn’t be applicable for local training. There’s discussion about Horovod integration for MXNet you might be interested in reading: https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
If the Parameter servers are hosted on CPU-only instances, then you wouldn’t be able to use ‘device’, you’re correct. You’d have to use ‘dist_async’ or ‘dist_sync’. You could only use ‘device’ if the parameter servers had GPUs.

Vishaal

Topic		Replies	Views
Gluon sync mode in single node? Gluon	1	321	November 7, 2018
Kvstore for distributed multi-gpu training Performance	10	2733	November 16, 2017
Training on gpu(1) and gpu(2) allocates some memory on gpu(0) Gluon	3	564	June 6, 2018
Single-machine multi-GPU training, time is not speeding up Gluon	5	2156	November 16, 2018
Multi system multi gpu distributed training slower than single system multi-gpu Performance	5	3422	December 22, 2021

Single-node multi-gpu machine

Related Topics