Can’t understand the meaning of head gradient.
Why give it a nd.array([10, 1., .1, .01])?
What do I get in x.grad if I don’t pass the head gradient to the
backward function? Isn’t it dz/dx?
I think backward should be applied to
y here not
z, that would make more sense to me.
And then the example should show a case where you could calculate dz/dy manually (possibly even not using mxnet), and still be able to use autograd for dy/dx to calculate dz/dx which is stored in
x.grad as you pointed out.
Something like this example:
import mxnet as mx x = mx.nd.array([0.,1.,2.,3.]) x.attach_grad() with mx.autograd.record(): y = x * 2 # dy/dz calculated outside of autograd dydz = mx.nd.array([10, 1., .1, .01]) y.backward(dydz) # thus calculating dz/dx, even though dz/dx was outside of autograd x.grad
[20. 2. 0.2 0.02] <NDArray 4 @cpu(0)>
Thank you for your reply. This makes sense to me. But I think the ‘dy/dz’ in the comment
# dy/dz calculated outside of autograd should be ‘dz/dy’. My understanding of your example is that you let the MXNet do the autograd on dy/dx which should be 2, and told autograd you already have the dz/dy part manually which is
[10, 1., .1, .01]. Then autograd store the dz/dy * dy/dx in x.grad as the final result. Am I right?
So the “head gradient” here just means the gradient of some calculation chains which don’t get recorded by autograd.
@vermicelli, I think you are correct. The last example here implies that head_gradient is calculated outside of autograd. I think example implies that this head_gradient is actualy gradient of some other function w(z) that is missing. That head_gradient is actually dw/dz. I would put that into comments block in the code just to clarify this piece a bit more. Other than that I think your understanding is correct.