The example in the python API document shows an implementation of sigmoid which supports autograd.
class sigmoid(Function):
def forward(self, x):
y = 1 / (1 + mx.nd.exp(-x))
self.save_for_backward(y)
return y
def backward(self, dy):
# backward takes as many inputs as forward's return value,
# and returns as many NDArrays as forward's arguments.
y, = self.saved_tensors
return y * (1-y)
I think the backward method must return dy * y * (1-y) instead of y * (1-y), doesn’t it?
@feevos It is because of the chain rule. For example, the following code procudes a wrong result without the dy factor.
x = mx.nd.array([1,2,3])
x.attach_grad()
f = sigmoid()
with mx.autograd.record():
y = f(x)
z = y * y
z.backward()
print(x.grad) #[0.28746966 0.18495613 0.08606823] is the right result.
I opened the issue #9872 for this example in the doc.
Thank you @dotelos, I verified the calculation you propose. Wow! So there is a significant difference between the derivative in pen and paper, and how it is done computationally. In addition I found that when one calculates y.backward() the value of dy defaults to [1.,1.,1.].
Do you have any good tutorial/reference to propose (in any deep learning framework), that describes explicitly the differences between theoretical functional forms of derivatives and how these are implemented inside a software library - using a computational graph? I am a bit confused as to what exactly the variable dy represents in the definition of the backward function.
@feevos I’m not aware of any reference. However, basically it is just a chain rule.
The point is that what f.backward(dy) actually calculates is not the derivative of f. It is the derivative of some unknown function where f is composed into. In the above case, f.backward(dy) must calculate the derivative of f(x)^2 instead of f itself. The implementation of f.backward does not know what the final function to be. However, in any case, it is reduced to a form of the derivative of g(f(x)) where g is the unknown function defined at runtime (x^2 in the above case or it could be some complex composition of functions). Then the derivative is g'(f(x)) f'(x) by the chain rule. The autograd module calculates dy=g'(f(x)) and give it to f.backward and the implementation of backward returns g'(f(x)) f'(x) = dy f'(x). The implementation can calculate it because it knows its own derivative f'(x) and given dy. How does the autograd module calculates dy? It is just a recursion. The implementation of backward of every operator, including * in the above case, takes dy. So, for example, let z(x) = f(g(h(x))). Then autograd module does the following when z.backward() is called.
dy = f.backward(1)
dy = g.backward(dy)
dy = h.backward(dy)
x.grad = dy
I’m not sure that it is explained well. Anyway the point is that, for any function f, what f.backward calculates is the derivative of g(f(x)) and not f(x) itself. Then whatever g is given, the result is g'(f(x)) f(x) and g'(f(x)) is calculated with a recursive application of the same rule. The essence is the same for multivariable functions but we need vectors and matrices instead of just numbers.