How to implement custom loss functions without label assignments (unsupervised)?


the example #5580 helped me pretty well starting to understand the data flow. thanks a lot @pengwangucla @saicoco.
now I wanna implement three custom loss functions which not only have an additional parameter (specifically a hyperparameter and not learned) but also are independant of the label (as the training is unsupervised and from that new layer perspective only depends on a binary transformation of the preceding layer).
W.r.t. this I got 3 issues:

  1. am I right to assume that we can either use mx.operator.CustomOp or mx.operator.NDArrayOp ?

  2. if I compare MXNets implementation of forward()/backward() pass to Caffe:
    template <typename Dtype> void CustomLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
    def forward(self, is_train, req, in_data, out_data, aux):
    contains the data of one batch (each row corresponds to one sample) and not just one sample, correct ?

  3. I like to train the network in Python and do the inference in C++ later: what is the way to do this ?
    c_predict_api.h, or
    mxnet-cpp/MxNetCpp.h as in
    in other words will these loss layers be registered for C++ as well or do we strictly have to rely on NNVM?

  4. I still don’t quite get the example on mxnet.incubator for new_op: If we look at the mathematical definition of the softmax function and its layer, it’s a function of the learned weight parameters wi and the inputs xi , so why are there no weights in the forward() pass ?


Hey @simomaur
For training, I would recommend using the Gluon API. Implementing a custom loss is straight-forward. Have a look at these file for the currently existing losses incubator-mxnet/python/mxnet/gluon/ You can just inherit the base Loss class and write your own hybrid_forward. The backward propagation will be automatically computed by autograd. However that only holds true if you can build your loss from existing operators.

  1. If you use gluon, and your loss can be computed with NDArray operators, you do not need to implement a custom operator, just implement your own Loss or HybridBlock. I would recommend following this tutorial:

  2. For a Hybrid block:

     def hybrid_forward(self, F, x, *args, **kwargs):
        """Overrides to construct symbolic graph for this `Block`.

        x : Symbol or NDArray
            The first input tensor.
        *args : list of Symbol or list of NDArray
            Additional input tensors.

The input x depends entirely on you. Usually the traditional layout for CNNs is to use NC, which means the first dimension of your input x is going to be the elements of your batch indeed. But it is totally up to you, you control the input.

  1. If you do the training in Python and inference for cpp, can you elaborate why you need to compute the loss at inference-time in cpp?

  2. The softmax function does not have parameters/weights and only depends on the input. Some frameworks have a ‘softmax layer’ which is actually a dense layer with a softmax function activation. In this case we are only implementing the softmax function:


@ThomasDelteil thanks a lot for your quick and comprehensive answer.
regarding 2) : at some point I will have to use a fork of MXNet, currently based on MXNet1.0.0 aka BMXNet. they have implemented optimized GEMM kernels, ie. specialized layers for binary neural networks (based on XNORNet/DoReFaNet papers). since they use the symbol API, Gluon would not be compatible, which is why I arrived at subclassing NDArrayOp/CustomOp, both would work, correct ?
regarding 3) actually during inference I don’t need the loss but the binarization of some FullyConnected layer, so a layer that just binarizes during forward pass (but that one must be available as well in the underlying C++ implementation)
to give you some more insight: I’m aiming to implement the following loss based on Caffe framework


@simomaur, I see. Have a look at this tutorial to implement a custom operator:
Have you had a look at using directly your Caffe layer into MXNet ?


@ThomasDelteil will have a look it for sure. you mean converting the Caffe layers using the scripts provided by the MXNet framework ?


@ThomasDelteil the tutorial you relinked (from Gluon) also use the operator classes provided by mx.operator module. the sigmoid example completely makes sense, I’ll have a look at the Softmax implementation again. seems to me that Gluon and Symbol API are somehow interchangeable as the operators accept the same input if you create your model… or am I missing something Gluon can provide ( Keras sequential and model API came to my mind as a parallel example ) ?


Gluon and Symbol API can both use the same custom operators. For using it in the Gluon API follow the guide I linked to. For the Symbol API, to use the custom operator, create a mx.sym.Custom symbol with op_type as the registered name:

mlp = mx.symbol.Custom(data=fc3, name='softmax', op_type='softmax')

taken from (


@ThomasDelteil ok, got it so far. to push it a little bit further and referring to your statement using mx.symbol.Custom(…):
let’s assume I have data batches (images) from a custom iterator (inherited from I have augmented the image data by using rotations with specific angles (lets denote these rotations per sample as alpha={0, 5, 10, 15, 20} in degrees. these are supplied as part of the sample instead of a label and shall be used inside a custom loss layer. how would you access these internally?
in other words can we do that in mxnet with the Python interface or do we rely on a C++ implementation and register it using NNVM, s.t. after compiling we can use it as:

mlp = mx.symbol.CustomAlphaDependantLoss(data=fc1, name='custom_loss')


@simomaur you don’t need to go to C++, you just need to declare it as an extra input. You can find below an example I wrote for a custom multiplication that takes two input parameters. One with the data and one with the associated factor. I hope that helps

import os
import mxnet as mx
import numpy as np
from mxnet import nd

class CustomMult(mx.operator.CustomOp):
    def forward(self, is_train, req, in_data, out_data, aux):
        y = in_data[0]*in_data[1]
        self.assign(out_data[0], req[0], y)
    def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
        # todo implement gradient calculation
        y = out_data[0]
        self.assign(in_grad[0], req[0], y)
class CustomMultProp(mx.operator.CustomOpProp):
    def __init__(self):
        super(CustomMultProp, self).__init__(need_top_grad=False)
    def list_arguments(self):
        return ['data', 'factor']

    def list_outputs(self):
        return ['output']

    def infer_shape(self, in_shape):
        data_shape = in_shape[0]
        mult_shape = in_shape[1]
        output_shape = in_shape[0]
        return [data_shape, mult_shape], [output_shape], []

    def infer_type(self, in_type):
        dtype = in_type[0]
        return [dtype, dtype], [dtype], []
    def create_operator(self, ctx, shapes, dtypes):
        return CustomMult()
data = mx.sym.var('data')
net = mx.symbol.Custom(data=data, name='mult', op_type='custom_mult')
batch_size = 5
d = mx.nd.array(np.arange(batch_size)).reshape((batch_size, 1, 1))*mx.nd.ones((batch_size, 3, 3))
f = mx.nd.array(np.arange(batch_size)).reshape((batch_size, 1, 1))
c = net.bind(args={'data': d, 'mult_factor': f}, ctx=mx.cpu())
print("Input Data", d)
Input Data 
[[[ 0.  0.  0.]
  [ 0.  0.  0.]
  [ 0.  0.  0.]]

 [[ 1.  1.  1.]
  [ 1.  1.  1.]
  [ 1.  1.  1.]]

 [[ 2.  2.  2.]
  [ 2.  2.  2.]
  [ 2.  2.  2.]]

 [[ 3.  3.  3.]
  [ 3.  3.  3.]
  [ 3.  3.  3.]]

 [[ 4.  4.  4.]
  [ 4.  4.  4.]
  [ 4.  4.  4.]]]
<NDArray 5x3x3 @cpu(0)>
print("Multiplication factor", f)
Multiplication factor 
[[[ 0.]]

 [[ 1.]]

 [[ 2.]]

 [[ 3.]]

 [[ 4.]]]
<NDArray 5x1x1 @cpu(0)>
Moving data forward through the symbol
 [[[  0.   0.   0.]
   [  0.   0.   0.]
   [  0.   0.   0.]]
  [[  1.   1.   1.]
   [  1.   1.   1.]
   [  1.   1.   1.]]
  [[  4.   4.   4.]
   [  4.   4.   4.]
   [  4.   4.   4.]]
  [[  9.   9.   9.]
   [  9.   9.   9.]
   [  9.   9.   9.]]
  [[ 16.  16.  16.]
   [ 16.  16.  16.]
   [ 16.  16.  16.]]]
 <NDArray 5x3x3 @cpu(0)>]


oh man, I got the example. really kind you took the time to come up with an example. thank you very much!
I now als got the relation of your , ie. in_data[i] contains the data of whatever you defined as input using named variables, not just strict ‘data’ and ‘label’


@ThomasDelteil could you quickly check the link I provided (snippet from Caffe: Loss

  1. if I got it right the forward/backward passes in Caffe only contain one sample at a time, ie
    from input std::vector<Blob<Dtype>*> bottom:
    bottom[0] only contains one input sample
    whereas in MXNet:
    in_data[0] contains a batch of samples
  2. now if my assumption is right, they need to save the losses on every call to forward to a member of the class (ie. in diff_.mutable_cpu_data()[0]) to get the loss over ALL samples.
    so following the tutorial links you provided, how do I assign the loss (a scalar) in MXNet using:
    self.assign(dst=out_data[0], req=req[0], src=?)
    ? The output usually assigned there as src is a vector (containing the output of a function over every sample input)