Implementation of sigmoid extending mx.autograd.Function


#1

The example in the python API document shows an implementation of sigmoid which supports autograd.

class sigmoid(Function):
    def forward(self, x):
        y = 1 / (1 + mx.nd.exp(-x))
        self.save_for_backward(y)
        return y

    def backward(self, dy):
        # backward takes as many inputs as forward's return value,
        # and returns as many NDArrays as forward's arguments.
        y, = self.saved_tensors
        return y * (1-y)

I think the backward method must return dy * y * (1-y) instead of y * (1-y), doesn’t it?


#2

Yes. I think it should add dy. Thank you for pointing this. Would you like to open a PR to fix this: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L375


#3

Why would one add a dy term? The derivative of sigmoid w.r.t. x is correctly calculated to be y * (1-y), where y = y(x).


#4

@feevos It is because of the chain rule. For example, the following code procudes a wrong result without the dy factor.

x = mx.nd.array([1,2,3])
x.attach_grad()
f = sigmoid()
with mx.autograd.record():
    y = f(x)
    z = y * y
z.backward()
print(x.grad) #[0.28746966 0.18495613 0.08606823] is the right result.

I opened the issue #9872 for this example in the doc.


#5

Thank you @dotelos, I verified the calculation you propose. Wow! So there is a significant difference between the derivative in pen and paper, and how it is done computationally. In addition I found that when one calculates y.backward() the value of dy defaults to [1.,1.,1.].

Do you have any good tutorial/reference to propose (in any deep learning framework), that describes explicitly the differences between theoretical functional forms of derivatives and how these are implemented inside a software library - using a computational graph? I am a bit confused as to what exactly the variable dy represents in the definition of the backward function.


#6

@feevos I’m not aware of any reference. However, basically it is just a chain rule.

The point is that what f.backward(dy) actually calculates is not the derivative of f. It is the derivative of some unknown function where f is composed into. In the above case, f.backward(dy) must calculate the derivative of f(x)^2 instead of f itself. The implementation of f.backward does not know what the final function to be. However, in any case, it is reduced to a form of the derivative of g(f(x)) where g is the unknown function defined at runtime (x^2 in the above case or it could be some complex composition of functions). Then the derivative is g'(f(x)) f'(x) by the chain rule. The autograd module calculates dy=g'(f(x)) and give it to f.backward and the implementation of backward returns g'(f(x)) f'(x) = dy f'(x). The implementation can calculate it because it knows its own derivative f'(x) and given dy. How does the autograd module calculates dy? It is just a recursion. The implementation of backward of every operator, including * in the above case, takes dy. So, for example, let z(x) = f(g(h(x))). Then autograd module does the following when z.backward() is called.

dy = f.backward(1)
dy = g.backward(dy)
dy = h.backward(dy)
x.grad = dy

I’m not sure that it is explained well. Anyway the point is that, for any function f, what f.backward calculates is the derivative of g(f(x)) and not f(x) itself. Then whatever g is given, the result is g'(f(x)) f(x) and g'(f(x)) is calculated with a recursive application of the same rule. The essence is the same for multivariable functions but we need vectors and matrices instead of just numbers.