Gradient compression default?


#1

Hi, from https://mxnet.incubator.apache.org/faq/gradient_compression.html I understand that gradient compression is enabled by default with gluon with kvstore = device. Does that mean that gradient compression is enabled by default on single-node, multi-GPU training with gluon?


#2

No. I don’t believe that’s the case. For single-node, multi-GPU training, I believe you have to specifically add kvstore='device' otherwise the default - local is used.


#3

But the default is device no? In the code I see https://mxnet.incubator.apache.org/_modules/mxnet/gluon/trainer.html def __init__(self, params, optimizer, optimizer_params=None, kvstore='device', compression_params=None, update_on_kvstore=None):


#4

Yes. you’re right the default is device. I just assumed it was local but I think the answer to your question is still No. I believe this line

“When kvstore is device , the communication between GPUs is compressed.”

is referring to when you enable gradient compression by passing in compression_params. i.e it’s detailing what type of compression happens when enabled not stating that compression is by default enabled.


#5

Hey @olivcruche and @sad, just adding to the discussion as the question has already been answered (it does mean that gradient compression is enabled by default on single-node, multi-GPU training, and the behavior can be amended with the compression_params kwarg to trainer).

The docs in KVStore.py are ambiguous:

    When kvstore is 'local', gradient compression is used to reduce communication
    between multiple devices (gpus). Gradient is quantized on each GPU which
    computed the gradients, then sent to the GPU which merges the gradients. This
    receiving GPU dequantizes the gradients and merges them. Note that this
    increases memory usage on each GPU because of the residual array stored.

    When kvstore is 'dist', gradient compression is used to reduce communication
    from worker to sender. Gradient is quantized on each worker which
    computed the gradients, then sent to the server which dequantizes
    this data and merges the gradients from each worker. Note that this
    increases CPU memory usage on each worker because of the residual array stored.
    Only worker to server communication is compressed in this setting.
    If each machine has multiple GPUs, currently this GPU to GPU or GPU to CPU communication
    is not compressed. Server to worker communication (in the case of pull)
    is also not compressed.

The first paragraph is ambiguous mentioning a ‘local’ KVStore in the context of GPUs. Here ‘local’ does not mean the enumeration type ‘local’, but rather the enumeration type ‘device’. So it looks like ‘local’ should be changed to ‘device’ in the docstring to be less ambiguous.