Gradient compression default?

olivcruche · December 4, 2018, 11:25am

Hi, from https://mxnet.incubator.apache.org/faq/gradient_compression.html I understand that gradient compression is enabled by default with gluon with kvstore = device. Does that mean that gradient compression is enabled by default on single-node, multi-GPU training with gluon?

sad · December 4, 2018, 7:49pm

No. I don’t believe that’s the case. For single-node, multi-GPU training, I believe you have to specifically add kvstore='device' otherwise the default - local is used.

olivcruche · December 4, 2018, 8:18pm

But the default is device no? In the code I see https://mxnet.incubator.apache.org/_modules/mxnet/gluon/trainer.html def __init__(self, params, optimizer, optimizer_params=None, kvstore='device', compression_params=None, update_on_kvstore=None):

sad · December 4, 2018, 8:48pm

Yes. you’re right the default is device. I just assumed it was local but I think the answer to your question is still No. I believe this line

“When kvstore is device , the communication between GPUs is compressed.”

is referring to when you enable gradient compression by passing in compression_params. i.e it’s detailing what type of compression happens when enabled not stating that compression is by default enabled.

VishaalKapoor · December 7, 2018, 11:09pm

Hey @olivcruche and @sad, just adding to the discussion as the question has already been answered (it does mean that gradient compression is enabled by default on single-node, multi-GPU training, and the behavior can be amended with the compression_params kwarg to trainer).

The docs in KVStore.py are ambiguous:

    When kvstore is 'local', gradient compression is used to reduce communication
    between multiple devices (gpus). Gradient is quantized on each GPU which
    computed the gradients, then sent to the GPU which merges the gradients. This
    receiving GPU dequantizes the gradients and merges them. Note that this
    increases memory usage on each GPU because of the residual array stored.

    When kvstore is 'dist', gradient compression is used to reduce communication
    from worker to sender. Gradient is quantized on each worker which
    computed the gradients, then sent to the server which dequantizes
    this data and merges the gradients from each worker. Note that this
    increases CPU memory usage on each worker because of the residual array stored.
    Only worker to server communication is compressed in this setting.
    If each machine has multiple GPUs, currently this GPU to GPU or GPU to CPU communication
    is not compressed. Server to worker communication (in the case of pull)
    is also not compressed.

The first paragraph is ambiguous mentioning a ‘local’ KVStore in the context of GPUs. Here ‘local’ does not mean the enumeration type ‘local’, but rather the enumeration type ‘device’. So it looks like ‘local’ should be changed to ‘device’ in the docstring to be less ambiguous.

Topic		Replies	Views
Async updates with Gluon Trainer with multiple devices on one node Gluon	2	676	February 10, 2018
Using multiple gluon trainers with kvstore Gluon	3	429	July 3, 2020
Multiple GPUs RNNs grad clip Discussion	1	864	June 12, 2018
Using pre-trained models: how to initialize the gluon Trainer? Gluon	0	255	February 15, 2023
Distributed Training / Model Parallelism with sparse embeddings in Gluon Gluon	2	536	June 19, 2019

Gradient compression default?

Related Topics