Automatic Differentiation




Can’t understand the meaning of head gradient.
Why give it a nd.array([10, 1., .1, .01])?
What do I get in x.grad if I don’t pass the head gradient to the backward function? Isn’t it dz/dx?


Hi @vermicelli,

I think backward should be applied to y here not z, that would make more sense to me.

And then the example should show a case where you could calculate dz/dy manually (possibly even not using mxnet), and still be able to use autograd for dy/dx to calculate dz/dx which is stored in x.grad as you pointed out.

Something like this example:

import mxnet as mx

x = mx.nd.array([0.,1.,2.,3.])

with mx.autograd.record():
    y = x * 2

# dy/dz calculated outside of autograd
dydz = mx.nd.array([10, 1., .1, .01])
# thus calculating dz/dx, even though dz/dx was outside of autograd
[20.    2.    0.2   0.02]
<NDArray 4 @cpu(0)>

@mli @smolix please confirm? Quite a complex example for an intro. Are there many use cases of this you’ve seen in the wild?


Thank you for your reply. This makes sense to me. But I think the ‘dy/dz’ in the comment # dy/dz calculated outside of autograd should be ‘dz/dy’. My understanding of your example is that you let the MXNet do the autograd on dy/dx which should be 2, and told autograd you already have the dz/dy part manually which is [10, 1., .1, .01]. Then autograd store the dz/dy * dy/dx in x.grad as the final result. Am I right?

So the “head gradient” here just means the gradient of some calculation chains which don’t get recorded by autograd.


@vermicelli, I think you are correct. The last example here implies that head_gradient is calculated outside of autograd. I think example implies that this head_gradient is actualy gradient of some other function w(z) that is missing. That head_gradient is actually dw/dz. I would put that into comments block in the code just to clarify this piece a bit more. Other than that I think your understanding is correct.