Derivative of Softmax


#1

Hi,

I do not understand the output of this short code:

a = nd.array([[[[0.1]] ,[[2]]]])
a.shape
(1, 2, 1, 1)
a.attach_grad()
with mx.autograd.record():
… a_rslt = nd.softmax(a, axis =1)

a_rslt.backward()
a_rslt

[[[[0.13010848]]

[[0.8698916 ]]]]
<NDArray 1x2x1x1 @cpu(0)>
a.grad

[[[[0.]]

[[0.]]]]
<NDArray 1x2x1x1 @cpu(0)>

Why is the gradient of softmax 0 on the input a? Intuitively, I interpret that as “no matter how I change a, it has no impact on the output of softmax” What is wrong with this interpretation of the code output? I assume that MXNET is computing the gradients correctly.

Thanks in advance for any reply


#2

I’m pretty sure this is right. Note that you’re taking the gradient of all softmax outputs. In an autograd system, when you take the gradient of multiple outputs, what you’ll end up doing is the equivalent of adding all the elements together and taking the gradient of that sum.

What’s going on, is if you look at the gradients using each individual element and sum them up, they’ll sum up to zero. The intuitive explanation is that because softmax ensures all outputs sum to one, the gradients from each individual output cancel each other out. If you write out the math of softmax gradient, you can make a more convicing argument for that. However, if you simply want to test it, try looking at the gradient from each output.

Here’s a simplified example:

a = mx.nd.array([[0.1, 2]])
a.attach_grad()
with mx.autograd.record():
    sm = mx.nd.softmax(a, axis=1)
    sm_0 = sm[(0, 0)]
sm_0.backward()
grad_0 = a.grad.copy()

with mx.autograd.record():
    sm = mx.nd.softmax(a, axis=1)
    sm_1 = sm[(0, 1)]
sm_1.backward()
grad_1 = a.grad.copy()

print(grad_0)
print(grad_1)
print(grad_0 + grad_1)

What you get is:

[[ 0.11318026 -0.11318026]]
<NDArray 1x2 @cpu(0)>

[[-0.11318026  0.11318021]]
<NDArray 1x2 @cpu(0)>

[[-7.4505806e-09 -5.2154064e-08]]
<NDArray 1x2 @cpu(0)>

Notice that they cancel each other out (within numerical precision)