Hi, I’d like to average gradients and update batches every N minibatches, to handle batches larger than GPU memory. I’m using this forum post, but details are lacking and I don’t manage to make it work
let’s take an example with N = 3.
In order to know when to aggregate gradients and update weights, I’m maintaining a
batch_counter that is incremented at every batch.
First, I configure the net this way:
for p in net.collect_params().values(): if p.grad_req != 'null': p.grad_req = 'add'
Then at every mini-batch I run this:
if batch_counter == 3: trainer.step(3) for p in net.collect_params().values(): p.zero_grad() batch_counter = 0
Doing this doesn’t train the SSD. Loss and mAP are erratic. When I do this accumulation every 1 batch (classical minibatch thing, without accumulation) then it trains correctly.
Can someone explain me how to make gradient accumulation work in MXNet? there needs to be a better tutorial for this, given how useful and important that feature is.