@feevos I’m not aware of any reference. However, basically it is just a chain rule.

The point is that what `f.backward(dy)`

actually calculates is not the derivative of `f`

. It is the derivative of some unknown function where `f`

is composed into. In the above case, `f.backward(dy)`

must calculate the derivative of `f(x)^2`

instead of `f`

itself. The implementation of `f.backward`

does not know what the final function to be. However, in any case, it is reduced to a form of the derivative of `g(f(x))`

where `g`

is the unknown function defined at runtime (x^2 in the above case or it could be some complex composition of functions). Then the derivative is `g'(f(x)) f'(x)`

by the chain rule. The autograd module calculates `dy=g'(f(x))`

and give it to `f.backward`

and the implementation of `backward`

returns `g'(f(x)) f'(x)`

= `dy f'(x)`

. The implementation can calculate it because it knows its own derivative `f'(x)`

and given `dy`

. How does the autograd module calculates `dy`

? It is just a recursion. The implementation of `backward`

of every operator, including `*`

in the above case, takes `dy`

. So, for example, let `z(x) = f(g(h(x)))`

. Then autograd module does the following when `z.backward()`

is called.

```
dy = f.backward(1)
dy = g.backward(dy)
dy = h.backward(dy)
x.grad = dy
```

I’m not sure that it is explained well. Anyway the point is that, for any function `f`

, what `f.backward`

calculates is the derivative of `g(f(x))`

and not `f(x)`

itself. Then whatever `g`

is given, the result is `g'(f(x)) f(x)`

and `g'(f(x))`

is calculated with a recursive application of the same rule. The essence is the same for multivariable functions but we need vectors and matrices instead of just numbers.