MXNet Forum

Batchnorm gradient


#1

I have a network consisting only of a batchnorm layer. The gradient that I get for batchnorm0_gamma after running a backward pass is different than the one I computed manually. I detailed my work in this link:
https://colab.research.google.com/github/x110/DLToolboxImg/blob/master/BatchNormMxnet.ipynb

Please advise.

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))
x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma']

The output I get is:
( [-1.0274856] <NDArray 1 @cpu(0)>, [0.] <NDArray 1 @cpu(0)>)

When calculating the gradient manually, the output i get is:

xi = x.asnumpy()
a = np.mean(xi)
b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-5)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn)

(-1.0274856090545654, -2.127872943878174)

The first gradient is same but the second is not.


#2

Hi,

The reason why the gradients are different is because when you use the BatchNorm operator you need to specify fix_gamma=False to make gamma learnable as it is by default set to true. See https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.BatchNorm for more info.

Changing your code slightly to include that gives the right answers:

import mxnet as mx
import numpy as np
X  = mx.nd.array([[ 0.18527887],[-1.23678724]])
Y = mx.nd.array([[ 2.57767984],[-1.55019435]])
#define network
source = mx.sym.Variable("data")
target = mx.sym.Variable("softmax_label")
network = mx.sym.BatchNorm(source, fix_gamma=False)
network=mx.sym.LinearRegressionOutput(network,target)
input_shapes = {'data': (2, 1), 'softmax_label': (2, 1)}
exe = network.simple_bind(ctx=mx.cpu(), **input_shapes)
#print exe
arg_arrays = dict(zip(network.list_arguments(), exe.arg_arrays))

x = arg_arrays['data']
t = arg_arrays['softmax_label']
#forward pass
x[:] = X
t[:] = Y

#print x, t
y = exe.forward(is_train=True)
#backwardpass
exe.backward()
print(exe.grad_dict['batchnorm0_beta'],exe.grad_dict['batchnorm0_gamma'])


xi = X.asnumpy()
a = np.mean(xi)

b = np.var(xi)
xn = (xi-a)/np.sqrt(b+1e-3)
beta, alpha = exe.arg_dict['batchnorm0_beta'].asnumpy(),exe.arg_dict['batchnorm0_gamma'].asnumpy()
ynorm = alpha * xn+beta
#backwardpass manually
print(2*np.mean((ynorm-t.asnumpy())),2*np.mean((ynorm-t.asnumpy())*xn))

Prints out

    (
    [-1.0274855]

    <NDArray 1 @cpu(0)>, 

    [-4.123798]

    <NDArray 1 @cpu(0)>)

    ('-1.0274854898452759', '-4.12379789352417')