I’m pretty sure this is right. Note that you’re taking the gradient of **all** softmax outputs. In an autograd system, when you take the gradient of multiple outputs, what you’ll end up doing is the equivalent of adding all the elements together and taking the gradient of that sum.

What’s going on, is if you look at the gradients using each individual element and sum them up, they’ll sum up to zero. The intuitive explanation is that because softmax ensures all *outputs* sum to one, the gradients from each individual output cancel each other out. If you write out the math of softmax gradient, you can make a more convicing argument for that. However, if you simply want to test it, try looking at the gradient from each output.

Here’s a simplified example:

```
a = mx.nd.array([[0.1, 2]])
a.attach_grad()
with mx.autograd.record():
sm = mx.nd.softmax(a, axis=1)
sm_0 = sm[(0, 0)]
sm_0.backward()
grad_0 = a.grad.copy()
with mx.autograd.record():
sm = mx.nd.softmax(a, axis=1)
sm_1 = sm[(0, 1)]
sm_1.backward()
grad_1 = a.grad.copy()
print(grad_0)
print(grad_1)
print(grad_0 + grad_1)
```

What you get is:

```
[[ 0.11318026 -0.11318026]]
<NDArray 1x2 @cpu(0)>
[[-0.11318026 0.11318021]]
<NDArray 1x2 @cpu(0)>
[[-7.4505806e-09 -5.2154064e-08]]
<NDArray 1x2 @cpu(0)>
```

Notice that they cancel each other out (within numerical precision)