How to make gradient accumulation work in MXNet?

olivcruche · October 20, 2019, 1:37pm

Hi, I’d like to average gradients and update batches every N minibatches, to handle batches larger than GPU memory. I’m using this forum post, but details are lacking and I don’t manage to make it work

let’s take an example with N = 3.
In order to know when to aggregate gradients and update weights, I’m maintaining a batch_counter that is incremented at every batch.

First, I configure the net this way:

for p in net.collect_params().values():
    if p.grad_req != 'null':
        p.grad_req = 'add'

Then at every mini-batch I run this:

if batch_counter == 3:
    trainer.step(3)
    for p in net.collect_params().values():
        p.zero_grad()
        batch_counter = 0

Doing this doesn’t train the SSD. Loss and mAP are erratic. When I do this accumulation every 1 batch (classical minibatch thing, without accumulation) then it trains correctly.

Can someone explain me how to make gradient accumulation work in MXNet? there needs to be a better tutorial for this, given how useful and important that feature is.

olivcruche · October 20, 2019, 4:26pm

I’m tempted to do the same thing as what is done for multi-gpu, eg along the lines of:

# loop through epochs
for e in range(3):  #  3 epochs
    
    # loop through logical batches (DataLoader done at macro-batch scale)
    for (data, label) in train_data:

        # split in microbatches (GPU-level batch)
        data = nd.split(data, num_outputs=4, axis=0)
        label = nd.split(label, num_outputs=4, axis=0)
        
        # compute losses
        for D, L in zip(data, label):
        
            # copy data to device
            D = D.as_in_context(ctx)
            L = L.as_in_context(ctx)

            with autograd.record():
                output = net(D)
                loss = SCE(output, L)

            # backprop
            loss.backward()
            accuracy.update(L, output)
    
    trainer.step(batch)  # using full batch (eg 4*GPU batch in this case)

Thoughts on this approach? looks correct?

feevos · October 21, 2019, 11:08pm

Hi,

can you please post your complete code? Looking at on old version of my code, this is the forward_backward step that is working (parallel gpu computing). Note that my models are not pre-trained (therefore I cannot think where I would see a grad_req=null value), and I never had to check if the initial grad_req is null as you do. Here _nbatch = total batch size, _data, _label are from gluon.split_and_load function (are a list of nd.arrays).

delay_rate = 8 # delay rate for averaging the gradients 
def forward_backward_step(_iteration, _nbatch, _net, _data, _label):
    with autograd.record():
        # First argument is PREDICTIONS, Second LABELS 
        losses = [ (SomeLossFunction(_net(inputs),labels)) for inputs, labels in zip(_data, _label)]

    # This is outside the autograd.record state 
    for l in losses: # Evaluate gradients in each ctx 
        l.backward()

    # This updates gradients across ALL devices, by first aggregating them. <3 Gluon!
    if (_iteration % delay_rate == 0):
        trainer.step(_nbatch * delay_rate)   
        for param in _net.collect_params().values():
            param.zero_grad()

    return losses

This is used in something like:

mynet = #SomeNetDefinition
Nbatch = batch_pre_gpu * len(ctx) # That is, total batch size in all available GPUs
for idx, (data, label) in enumerate(SomeDataLoader):
    data  = gluon.utils.split_and_load(data,ctx)
    label = gluon.utils.split_and_load(label,ctx)
   
   losses = forward_backward_step( idx, Nbatch, mynet, data, label)
   # do other stuff/monitoring etc.

If you can please post a working part of your code (even multigpu) I can test/help.

Topic		Replies	Views
How to accumulate gradients over multiple mini-batches in Keras-MXNet Discussion	3	1585	August 29, 2019
Single-machine multi-GPU training, time is not speeding up Gluon	5	2154	November 16, 2018
Inconsistent results on GPU Discussion	0	313	March 20, 2020
Best practices for prediction on a machine with multiple GPUs	3	1186	November 8, 2017
Understanding MXNet multi-gpu performance Performance	7	1834	November 5, 2018

How to make gradient accumulation work in MXNet?

Related Topics