How to make gradient accumulation work in MXNet?

Hi, I’d like to average gradients and update batches every N minibatches, to handle batches larger than GPU memory. I’m using this forum post, but details are lacking and I don’t manage to make it work

let’s take an example with N = 3.
In order to know when to aggregate gradients and update weights, I’m maintaining a batch_counter that is incremented at every batch.

First, I configure the net this way:

for p in net.collect_params().values():
    if p.grad_req != 'null':
        p.grad_req = 'add'

Then at every mini-batch I run this:

if batch_counter == 3:
    trainer.step(3)
    for p in net.collect_params().values():
        p.zero_grad()
        batch_counter = 0

Doing this doesn’t train the SSD. Loss and mAP are erratic. When I do this accumulation every 1 batch (classical minibatch thing, without accumulation) then it trains correctly.

Can someone explain me how to make gradient accumulation work in MXNet? there needs to be a better tutorial for this, given how useful and important that feature is.

I’m tempted to do the same thing as what is done for multi-gpu, eg along the lines of:

# loop through epochs
for e in range(3):  #  3 epochs
    
    # loop through logical batches (DataLoader done at macro-batch scale)
    for (data, label) in train_data:

        # split in microbatches (GPU-level batch)
        data = nd.split(data, num_outputs=4, axis=0)
        label = nd.split(label, num_outputs=4, axis=0)
        
        # compute losses
        for D, L in zip(data, label):
        
            # copy data to device
            D = D.as_in_context(ctx)
            L = L.as_in_context(ctx)

            with autograd.record():
                output = net(D)
                loss = SCE(output, L)

            # backprop
            loss.backward()
            accuracy.update(L, output)
    
    trainer.step(batch)  # using full batch (eg 4*GPU batch in this case)

Thoughts on this approach? looks correct?

Hi,

can you please post your complete code? Looking at on old version of my code, this is the forward_backward step that is working (parallel gpu computing). Note that my models are not pre-trained (therefore I cannot think where I would see a grad_req=null value), and I never had to check if the initial grad_req is null as you do. Here _nbatch = total batch size, _data, _label are from gluon.split_and_load function (are a list of nd.arrays).

delay_rate = 8 # delay rate for averaging the gradients 
def forward_backward_step(_iteration, _nbatch, _net, _data, _label):
    with autograd.record():
        # First argument is PREDICTIONS, Second LABELS 
        losses = [ (SomeLossFunction(_net(inputs),labels)) for inputs, labels in zip(_data, _label)]

    # This is outside the autograd.record state 
    for l in losses: # Evaluate gradients in each ctx 
        l.backward()

    # This updates gradients across ALL devices, by first aggregating them. <3 Gluon!
    if (_iteration % delay_rate == 0):
        trainer.step(_nbatch * delay_rate)   
        for param in _net.collect_params().values():
            param.zero_grad()

    return losses

This is used in something like:

mynet = #SomeNetDefinition
Nbatch = batch_pre_gpu * len(ctx) # That is, total batch size in all available GPUs
for idx, (data, label) in enumerate(SomeDataLoader):
    data  = gluon.utils.split_and_load(data,ctx)
    label = gluon.utils.split_and_load(label,ctx)
   
   losses = forward_backward_step( idx, Nbatch, mynet, data, label)
   # do other stuff/monitoring etc. 

If you can please post a working part of your code (even multigpu) I can test/help.