Hi All,

I am training a Siamese network on triplet loss. I am following the FaceNet paper to do so. In the FaceNet paper (according to my understanding, please correct me if I am wrong), embeddings of anchor, positive and negative are divided by their corresponding L2_Norm. This is done so that the resultant vectors (of dimension, say, d) live on the d-dimensional hypersphere. As a result, the maximum distance between two vectors are constrained and hence a proper margin can be chosen. I couldnâ€™t find a L2_norm layer in gluon, so went ahead and implemented one. However, when I started training, I came across the following blocker: **â€śoperator _backward_norm is not implemented for gpuâ€ť** . I googled the exact same error and couldnâ€™t find an exact hit. There were some partial hits and suggested solution was re-compiling mxnet properly. That doesnâ€™t work for me as I am using the Amazon Deep Learning AMI for Ubuntu (version 12). As a workaround I can train on CPU, but that is more than 10 times slower than GPU. I have written the following reproducible code:

```
from mxnet import autograd
from mxnet import gluon, nd
import mxnet as mx
class L2_Normalize(gluon.HybridBlock):
def __init__(self, eps=1e-05, **kwargs):
super(L2_Normalize, self).__init__(**kwargs)
self.eps = eps
def hybrid_forward(self, F, x):
l2_norm = F.reshape(F.norm(x, axis=1), (-1, 1)) + self.eps
return F.broadcast_div(x, l2_norm)
net = gluon.nn.HybridSequential()
with net.name_scope():
net.add(gluon.nn.Dense(512))
net.add(L2_Normalize())
net.collect_params().initialize(mx.init.Xavier(), ctx = mx.gpu())
net.hybridize()
dataset = gluon.data.dataset.ArrayDataset(nd.random.normal(shape=(100, 512)), nd.random.normal(shape=(100, 512)), nd.random.normal(shape=(100, 512)))
dataloader = gluon.data.DataLoader(dataset, batch_size=50, num_workers=0)
triplet_loss = gluon.loss.TripletLoss()
trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': 0.005})
for d1, d2, d3 in dataloader:
d1 = d1.as_in_context(mx.gpu())
d2 = d2.as_in_context(mx.gpu())
d3 = d3.as_in_context(mx.gpu())
with autograd.record():
o1 = net(d1)
o2 = net(d2)
o3 = net(d3)
loss = triplet_loss(o1, o2, o3)
loss.backward()
```

Is there any workaround for this problem other than CPU training? Any help is appreciated!