Loss function in Mxnet C++



I have a very simple question, yet I can’t figure out what to do. Where and how to define the loss function when using mxnet in C++? And what is the default loss function?



See the basics and these examples.


Look at:

Examples are using the SoftmaxOutput.
The SoftmaxOutput is the softmax AND loss layer at the same time.
Do implement another loss strategy, you must use Make_loss.

ref. SoftmaxOutput source is in: src/operator/softmax_output-inl.h
ref. https://stackoverflow.com/questions/48044595/a-weighted-version-of-softmaxoutput-in-mxnet
ref. https://mxnet.incubator.apache.org/tutorials/r/CustomLossFunction.html


@ehsanmok Thanks, but that is indeed what I tried to do, as any newbie would. But I fell in the pitfall described by @alinagithub.

@alinagithub Thank you! That was indeed my problem. The mix between the last layer and the loss function is quite confusing! All other packages tends to separate those two. Why this choice? And why the C++ tutorial for mxnet doesn’t choose to explain by using the most generic way create the loss function?


Looking at the C++ example, it seems that none of them are using the make_loss function. Is it normal?
And how to get the loss computed during the forward pass?


Hi @dmidge,

As I understand it, SoftmaxOutput combines the softmax and loss in a single layer so the model can be used for training and inference without any changes needing to be made. As part of the forward pass, only softmax is calculated, and as part of the backward pass the loss is then calculated and gradient start to propagate back. When using model for inference, only the softmax is calculated as it’s the only thing that’s needed.

Another reason to combine softmax and loss is for improved numerical stability and reduced memory requirements. Many other frameworks have a similar concept. See Caffe’s equivalent.


Hi @thomelane,

Thanks for your further explanations! :slight_smile:
But then, it is not really possible to get the value of the loss being computed during the backward pass, is it?
And if I want to split this softmax stage, what would it be?
Symbol lenet = SoftmaxOutput("softmax", fc2, data_label);

For example, if I want to use a different loss function?
I guess that there is, Symbol lenet = Softmax("softmax", fc2);, that I should use. But I quickly get a deprecation warning message.
And then, there is the loss function. I guess that there are predefined loss functions. (I’ve seen LinearRegressionOutput or MAERegressionOutput. I don’t think they have other purposes than attaching the loss function, do they?).
And the usage of MakeLoss is a bit foggy to me. I would have thought that I would receive three mandatory inputs: the last layer of the network where it is applied on, the expected output layer node, and a function to do the computation. Instead it seems that I receive only one input…

Thanks again! :slight_smile:


The loss function that you want to typically use with a softmax output is cross entropy loss and if you want to use that, you’d need to implement the loss like this:

cross_entropy = label * log(out) + (1 - label) * log(1 - out)
loss = MakeLoss(cross_entropy)

Typically when calculating gradient of a block, you need access to the gradient of the upper blocks (based on the chain rule). What MakeLoss() does is that it acts as the termination of the network and feeds a 1 in backward() call to the lower layers. If you want to simplify this concept for yourself, just know that you cannot have dangling symbols in your computational graph. Either they have to end in a loss (either one of the internal loss operators or MakeLoss() with your own custom loss) or explicitly be excluded from gradient calculation using sym.BlockGrad()


Thank you a lot for your additional help @safrooze!

I still have some related questions. Whenever I create a custom loss function, it seems to change the output node that I can fetch after a forward pass. I couldn’t identify what output I get exactly, but it looks like the cross_entropy loss value.
Anyway, how to choose the output layer from the Executor then?