Multiple losses



I’m working on two customized loss functions and was able to implement them inheriting from mx.operator.CustomOp as custom layers. The problem is that I didn’t achieve the correct backward() passes yet, so I tried using Gluon API (having automatic differentiation) and implemented hybrid_forward().
Each loss works individually, now I try to combine them and minimize it in parallel for a deep neural network model (pretrained VGG16 and replaced last SoftmaxOutput with a FullyConnected and 256 neurons). I successfully loaded the sym/arg_params using mx.gluon.nn.SymbolBlock and initialized the weights (incl random initialization for the last newly added layer)

Using Symbol API we can do:

loss1 = mx.sym.Custom(data1, name=‘loss1’, op_type=‘customloss1’)
loss2 = mx.sym.Custom(data2, name=‘loss2’, op_type=‘customloss2’)

combined = mx.sym.Group(loss1, loss2)

and the load and bind it with the Module API.

  1. How can I achieve the same thing using Gluon API?
  2. The default losses (in Gluon API, e.g. L2Loss) are always vectors of losses (equalling batch_size) - can we return a single loss (mean over batch) or any number s.t. autograd still works?

I encountered the following error, when using loss_1.backward() and immediately loss_2.backward() outside of with autograd.record():

mxnet.base.MXNetError: [15:01:00] src/imperative/ Check failed: xs.size() > 0 (0 vs. 0) There are no inputs in computation graph that require gradients.

Might this be related to the fact that loss_1 is of shape (batch_size,1) while loss_2 is of shape (num_neurons_of_last_layer, ) ?


trainer = gluon.Trainer(net_params, optimizer=optimizer, optimizer_params={‘learning_rate’: lr})
for epoch in range(n_epochs):
print(‘Epoch %d started’ % (epoch))
for i, batch in enumerate(iterator):
print(‘Batch %i: %s’ %(i,[0].shape))
with autograd.record():
output = net([0])
first_loss = mq_loss(output, None)
second_loss = ed_loss(output, None)
print(‘Epoch %d finished!’ % (epoch))

Gluon: access layer weights

You want to combine your losses so that they can be propagated together. I would suggest to combine your losses to get a single scalar, for example :

loss = first_loss.sum() + lambda * second_loss.sum()

where lambda is a constant of your choosing that balances out the effect of each loss in your overall loss term.


thanks a lot. are you sure it neets to be .sum(), not .mean()? according to a paper I’m implementing the parameters that balance each loss are both 1.0, so in the end it doesnt balance out anything:
loss = alpha(=1.0)*first_loss.sum() + beta(=1.0)*second_loss.sum()


@simomaur you can use sum() or mean() or whatever makes sense in your specific use-case since you said your losses do not have the same shape, I didn’t want to assume anything :slight_smile: mean is sum/n so it is only a scaling factor away from the other.
Regarding your second point, don’t worry it is not compulsory to have different weights for each loss and 1 is as valid a weight as any other if this gives you the expected result.


ah I see ;). meanwhile I managed to run several train iterations. first loss fine so far. what I observed is that for the second loss, the optimization is strange. strange in the following way:
let’s assume I have a batch_size of b with an input dimension of n, ie. cols:8, rows: n for the input NDArray x. in hybrid_forward(self, F, x) I binarize the input x, such that we have b samples of size n, with n(i) either 0/1. what the second loss should minimize is the mean over the number of bits, s.t. the bits will be evenly distributed (across the samples, e.g. first bit(col) is 1 for the first 4 samples, 0 for the other 4). what I get when inspecting the output (say for 8 samples, n=8) is that the cols are
is that a problem because of how the backpropagation in the MXNet backend works, ie. it assumes a loss vector with size (batch_size, )?

code hybrid_forward(self, F, x, **kwargs):

y = mx.nd.sign(x)
b = 0.5 * (y + 1)
mu_n = F.mean(b_n, axis=0)
loss = F.square(mu_n - 0.5)

To further validate this, I’m overfitting to the data (only with second loss) by only feeding in 8 samples (equalling batch_size) repeatedly


I think you are missing a word here.

Also in the code you pasted, b_n is not defined.

You said you want to minimize the mean over the number of bits, as in the mean across samples for a given column, or minimize the mean over the number of bits for each sample?

Because right now you are using F.mean(..., axis=0) which gives you the mean across samples. I think you might want to use F.mean(..., axis=1).

using your notations and n=4:

b = 8
n = 4
x = mx.ndarray.random.uniform(shape=(b, n)).round()

[[ 1.  1.  1.  0.]
 [ 1.  0.  1.  0.]
 [ 1.  0.  0.  1.]
 [ 1.  1.  1.  0.]
 [ 1.  0.  0.  1.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  1.]
 [ 1.  0.  0.  0.]]
<NDArray 8x4 @cpu(0)>

[ 0.75  0.25  0.5   0.5 ]
<NDArray 4 @cpu(0)>

[ 0.75  0.5   0.5   0.75  0.5   0.25  0.5   0.25]
<NDArray 8 @cpu(0)>


yes, you’re correct. I denoted b as the batch_size and b_n as the binary array with cols:b and rows:n.
so the aim is to evenly distribute the bits across the cols (so axis=0 should be correct?), but after several epochs several bins (cols) only contains all zeros or ones.
for 1470 samples (training set), do I need to increase the batch_size and/or num of bits to make a valid conclusion about the output?
this is why I tried to use exactly the same 8 training samples to overfit to the data, which would allows us to see if the optimization function is correct.
again as noted in DeepBit the second loss they are optimizing first calculates the mean for each bin (so col!) and then calculates the mean of squared(mu_n - 0.5) over these bins


one remark: