Automatic Differentiation


Can’t understand the meaning of head gradient.
Why give it a nd.array([10, 1., .1, .01])?
What do I get in x.grad if I don’t pass the head gradient to the backward function? Isn’t it dz/dx?

Hi @vermicelli,

I think backward should be applied to y here not z, that would make more sense to me.

And then the example should show a case where you could calculate dz/dy manually (possibly even not using mxnet), and still be able to use autograd for dy/dx to calculate dz/dx which is stored in x.grad as you pointed out.

Something like this example:

import mxnet as mx

x = mx.nd.array([0.,1.,2.,3.])

with mx.autograd.record():
    y = x * 2

# dy/dz calculated outside of autograd
dydz = mx.nd.array([10, 1., .1, .01])
# thus calculating dz/dx, even though dz/dx was outside of autograd
[20.    2.    0.2   0.02]
<NDArray 4 @cpu(0)>

@mli @smolix please confirm? Quite a complex example for an intro. Are there many use cases of this you’ve seen in the wild?

Thank you for your reply. This makes sense to me. But I think the ‘dy/dz’ in the comment # dy/dz calculated outside of autograd should be ‘dz/dy’. My understanding of your example is that you let the MXNet do the autograd on dy/dx which should be 2, and told autograd you already have the dz/dy part manually which is [10, 1., .1, .01]. Then autograd store the dz/dy * dy/dx in x.grad as the final result. Am I right?

So the “head gradient” here just means the gradient of some calculation chains which don’t get recorded by autograd.

@vermicelli, I think you are correct. The last example here implies that head_gradient is calculated outside of autograd. I think example implies that this head_gradient is actualy gradient of some other function w(z) that is missing. That head_gradient is actually dw/dz. I would put that into comments block in the code just to clarify this piece a bit more. Other than that I think your understanding is correct.


Thank you for this book. This is my first step in DL, and even though I manage to understand, some exercises are really interesting but out of my reach. The 2nd bidder is one of them, even though I’d love to know the answer. Are you providing the solution somewhere, please?

@thomelane @vermicelli I rewrote this chapter, hope it’s clearer now. Preview at

Hi @mli,
In the section Head Gradients, why do we calculate x.grad and y.grad after z.backward()? I guess it should be v.grad. This v.grad will be pass as input while we find du/dx. Correct me if I am wrong.