Implementation of sigmoid extending mx.autograd.Function

dotelos · February 16, 2018, 6:55am

The example in the python API document shows an implementation of sigmoid which supports autograd.

class sigmoid(Function):
    def forward(self, x):
        y = 1 / (1 + mx.nd.exp(-x))
        self.save_for_backward(y)
        return y

    def backward(self, dy):
        # backward takes as many inputs as forward's return value,
        # and returns as many NDArrays as forward's arguments.
        y, = self.saved_tensors
        return y * (1-y)

I think the backward method must return dy * y * (1-y) instead of y * (1-y), doesn’t it?

anirudh2290 · February 24, 2018, 6:27pm

Yes. I think it should add dy. Thank you for pointing this. Would you like to open a PR to fix this: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/autograd.py#L375

feevos · February 25, 2018, 11:49pm

Why would one add a dy term? The derivative of sigmoid w.r.t. x is correctly calculated to be y * (1-y), where y = y(x).

dotelos · February 26, 2018, 4:30am

@feevos It is because of the chain rule. For example, the following code procudes a wrong result without the dy factor.

x = mx.nd.array([1,2,3])
x.attach_grad()
f = sigmoid()
with mx.autograd.record():
    y = f(x)
    z = y * y
z.backward()
print(x.grad) #[0.28746966 0.18495613 0.08606823] is the right result.

I opened the issue #9872 for this example in the doc.

feevos · February 26, 2018, 6:12am

Thank you @dotelos, I verified the calculation you propose. Wow! So there is a significant difference between the derivative in pen and paper, and how it is done computationally. In addition I found that when one calculates y.backward() the value of dy defaults to [1.,1.,1.].

Do you have any good tutorial/reference to propose (in any deep learning framework), that describes explicitly the differences between theoretical functional forms of derivatives and how these are implemented inside a software library - using a computational graph? I am a bit confused as to what exactly the variable dy represents in the definition of the backward function.

dotelos · February 26, 2018, 11:21am

@feevos I’m not aware of any reference. However, basically it is just a chain rule.

The point is that what f.backward(dy) actually calculates is not the derivative of f. It is the derivative of some unknown function where f is composed into. In the above case, f.backward(dy) must calculate the derivative of f(x)^2 instead of f itself. The implementation of f.backward does not know what the final function to be. However, in any case, it is reduced to a form of the derivative of g(f(x)) where g is the unknown function defined at runtime (x^2 in the above case or it could be some complex composition of functions). Then the derivative is g'(f(x)) f'(x) by the chain rule. The autograd module calculates dy=g'(f(x)) and give it to f.backward and the implementation of backward returns g'(f(x)) f'(x) = dy f'(x). The implementation can calculate it because it knows its own derivative f'(x) and given dy. How does the autograd module calculates dy? It is just a recursion. The implementation of backward of every operator, including * in the above case, takes dy. So, for example, let z(x) = f(g(h(x))). Then autograd module does the following when z.backward() is called.

dy = f.backward(1)
dy = g.backward(dy)
dy = h.backward(dy)
x.grad = dy

I’m not sure that it is explained well. Anyway the point is that, for any function f, what f.backward calculates is the derivative of g(f(x)) and not f(x) itself. Then whatever g is given, the result is g'(f(x)) f(x) and g'(f(x)) is calculated with a recursive application of the same rule. The essence is the same for multivariable functions but we need vectors and matrices instead of just numbers.

Topic		Replies	Views
Automatic Differentiation D2L Book	22	3098	December 13, 2019
Understanding Autograd.backward() with custom parameters for specific layers Discussion	3	786	September 17, 2019
Difference b/w loss.backward() and mx.autograd.backwars([loss]) Discussion	2	2362	May 14, 2019
How to implement the addtion of grad in the backback-propagating,how to add extra term (which is the gradient to middle net layer output) to the network	2	591	August 18, 2018
Derivative of Softmax Discussion	1	731	December 24, 2018

Implementation of sigmoid extending mx.autograd.Function

Related Topics