Difference b/w loss.backward() and mx.autograd.backwars([loss])

Dear all,
I’m facing trouble figuring out what is the difference between using .backward() and mx.autograd.backward to calculate gradients. The documentation doesn’t provide any difference. Once when I was optimizing 2 objective functions loss1 and loss2 using loss1.backward(), loss2.backward() Then it shows error:

Check failed: !AGInfo::IsNone(*i) Cannot differentiate node because it is not in a computational graph. You need to set is_recording to true or use autograd.record() to save computational graphs for backward. If you want to differentiate the same graph twice, you need to pass retain_graph=True to backward.

While mx.autograph.backward([loss1, loss2]) works fine.

Any help is appreciated.

As far as I know, when you call autograd.backward, it would go through all heads you provide, calculate gradients and sum them in grad properties of the respected weights. After that the graph would be removed.

So it is the same as if you summed up the loss manually and do the gradient on summed loss:

with autograd.record():
    out = model(data1)
    loss1 = loss_fn1(out, label)
    loss2 = loss_fn2(out, label)
    loss = loss1 + loss2

loss.backward()

Compare to that, when you call backwards separately on losses, the graph is destroyed by default after the first call and the second call fails, because there is no graph anymore. You can change this behaviour by preserving the graph after the first call: loss1.backward(retain_graph=True).

But notice, that the call of loss2.backward will overwrite the gradients calculated by the first loss. If you want them to be accumulated, you need to set model.collect_params().setattr('grad_req', 'add') - this will keep gradients and sum them up (don’t forget to zero out gradients after each iteration - https://mxnet.incubator.apache.org/api/python/gluon/gluon.html?highlight=zero_#mxnet.gluon.ParameterDict.zero_grad)

1 Like

Thank you very much for such clear explanation.