MXnet error: Backward Norm Operator not implemented in GPU


Hi All,

I am training a Siamese network on triplet loss. I am following the FaceNet paper to do so. In the FaceNet paper (according to my understanding, please correct me if I am wrong), embeddings of anchor, positive and negative are divided by their corresponding L2_Norm. This is done so that the resultant vectors (of dimension, say, d) live on the d-dimensional hypersphere. As a result, the maximum distance between two vectors are constrained and hence a proper margin can be chosen. I couldn’t find a L2_norm layer in gluon, so went ahead and implemented one. However, when I started training, I came across the following blocker: “operator _backward_norm is not implemented for gpu” . I googled the exact same error and couldn’t find an exact hit. There were some partial hits and suggested solution was re-compiling mxnet properly. That doesn’t work for me as I am using the Amazon Deep Learning AMI for Ubuntu (version 12). As a workaround I can train on CPU, but that is more than 10 times slower than GPU. I have written the following reproducible code:

from mxnet import autograd
from mxnet import gluon, nd
import mxnet as mx
class L2_Normalize(gluon.HybridBlock):
    def __init__(self, eps=1e-05, **kwargs):
        super(L2_Normalize, self).__init__(**kwargs)
        self.eps = eps
    def hybrid_forward(self, F, x):
        l2_norm = F.reshape(F.norm(x, axis=1), (-1, 1)) + self.eps
        return F.broadcast_div(x, l2_norm)
net = gluon.nn.HybridSequential()
with net.name_scope():
net.collect_params().initialize(mx.init.Xavier(), ctx = mx.gpu())
dataset =, 512)), nd.random.normal(shape=(100, 512)), nd.random.normal(shape=(100, 512)))
dataloader =, batch_size=50, num_workers=0)
triplet_loss = gluon.loss.TripletLoss()
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': 0.005})
for d1, d2, d3 in dataloader:
    d1 = d1.as_in_context(mx.gpu())
    d2 = d2.as_in_context(mx.gpu())
    d3 = d3.as_in_context(mx.gpu())
    with autograd.record():
        o1 = net(d1)
        o2 = net(d2)
        o3 = net(d3)
        loss = triplet_loss(o1, o2, o3)

Is there any workaround for this problem other than CPU training? Any help is appreciated!


Try to compute the L2 norm directly rather than using .norm:

This should work if x is a batch of vectors (N, C) type of layout, otherwise just reshape first and then do that

l2_x = (x*x).sum(axis=1).sqrt()

See the following example:

>>> x = mx.nd.array([[1, 2, 3, 4],[5,6,7,8]])
>>> x

[[ 1.  2.  3.  4.]
 [ 5.  6.  7.  8.]]
<NDArray 2x4 @cpu(0)>
>>> l2_x = (x*x).sum(axis=1).sqrt().expand_dims(axis=1)
>>> y = mx.nd.broadcast_div(x, l2_x)
>>> y

[[ 0.18257418  0.36514837  0.54772252  0.73029673]
 [ 0.37904903  0.45485884  0.53066862  0.60647845]]
<NDArray 2x4 @cpu(0)>

# Norm of the normalized vector:
>>> (y*y).sum(axis=1).sqrt()

[ 1.  1.        ]
<NDArray 2 @cpu(0)>


I have replaced using of L2Normalization and it seems it gives almost the same result as your implementation. Please, compare:

import mxnet
from mxnet import nd

a = nd.array([[1,2],[3,4]])

nd.L2Normalization(a, eps=1e-05, mode='instance')


[[ 0.44721317 0.89442635]
[ 0.5999999 0.79999983]]


L2(a, 1e-05)

[[ 0.44721159  0.89442319]
 [ 0.59999877  0.7999984 ]]

That means, you can replace your calculation of L2 norm with out of the box implementation. Since you are doing HybridBlock, you can use F.L2Normalization.

Hope it helps.