Multiple losses

simomaur · May 26, 2018, 12:54pm

Hi

I’m working on two customized loss functions and was able to implement them inheriting from mx.operator.CustomOp as custom layers. The problem is that I didn’t achieve the correct backward() passes yet, so I tried using Gluon API (having automatic differentiation) and implemented hybrid_forward().
Each loss works individually, now I try to combine them and minimize it in parallel for a deep neural network model (pretrained VGG16 and replaced last SoftmaxOutput with a FullyConnected and 256 neurons). I successfully loaded the sym/arg_params using mx.gluon.nn.SymbolBlock and initialized the weights (incl random initialization for the last newly added layer)

Using Symbol API we can do:

loss1 = mx.sym.Custom(data1, name=‘loss1’, op_type=‘customloss1’)
loss2 = mx.sym.Custom(data2, name=‘loss2’, op_type=‘customloss2’)

combined = mx.sym.Group(loss1, loss2)

and the load and bind it with the Module API.

How can I achieve the same thing using Gluon API?
The default losses (in Gluon API, e.g. L2Loss) are always vectors of losses (equalling batch_size) - can we return a single loss (mean over batch) or any number s.t. autograd still works?

I encountered the following error, when using loss_1.backward() and immediately loss_2.backward() outside of with autograd.record():

mxnet.base.MXNetError: [15:01:00] src/imperative/imperative.cc:429: Check failed: xs.size() > 0 (0 vs. 0) There are no inputs in computation graph that require gradients.

Might this be related to the fact that loss_1 is of shape (batch_size,1) while loss_2 is of shape (num_neurons_of_last_layer, ) ?

Code:

trainer = gluon.Trainer(net_params, optimizer=optimizer, optimizer_params={‘learning_rate’: lr})
for epoch in range(n_epochs):
print(‘Epoch %d started’ % (epoch))
iterator.reset()
for i, batch in enumerate(iterator):
print(‘Batch %i: %s’ %(i,batch.data[0].shape))
with autograd.record():
output = net(batch.data[0])
first_loss = mq_loss(output, None)
second_loss = ed_loss(output, None)
first_loss.backward()
second_loss.backward()
trainer.step(batch_size=batch_size)
print(‘Epoch %d finished!’ % (epoch))

ThomasDelteil · May 28, 2018, 6:28pm

You want to combine your losses so that they can be propagated together. I would suggest to combine your losses to get a single scalar, for example :

loss = first_loss.sum() + lambda * second_loss.sum()
loss.backward()

where lambda is a constant of your choosing that balances out the effect of each loss in your overall loss term.

simomaur · May 30, 2018, 9:51am

thanks a lot. are you sure it neets to be .sum(), not .mean()? according to a paper I’m implementing the parameters that balance each loss are both 1.0, so in the end it doesnt balance out anything:
loss = alpha(=1.0)*first_loss.sum() + beta(=1.0)*second_loss.sum()
DeepBit

ThomasDelteil · May 30, 2018, 5:27pm

@simomaur you can use sum() or mean() or whatever makes sense in your specific use-case since you said your losses do not have the same shape, I didn’t want to assume anything mean is sum/n so it is only a scaling factor away from the other.
Regarding your second point, don’t worry it is not compulsory to have different weights for each loss and 1 is as valid a weight as any other if this gives you the expected result.

simomaur · May 31, 2018, 3:36pm

ah I see ;). meanwhile I managed to run several train iterations. first loss fine so far. what I observed is that for the second loss, the optimization is strange. strange in the following way:
let’s assume I have a batch_size of b with an input dimension of n, ie. cols:8, rows: n for the input NDArray x. in hybrid_forward(self, F, x) I binarize the input x, such that we have b samples of size n, with n(i) either 0/1. what the second loss should minimize is the mean over the number of bits, s.t. the bits will be evenly distributed (across the samples, e.g. first bit(col) is 1 for the first 4 samples, 0 for the other 4). what I get when inspecting the output (say for 8 samples, n=8) is that the cols are
is that a problem because of how the backpropagation in the MXNet backend works, ie. it assumes a loss vector with size (batch_size, )?

code hybrid_forward(self, F, x, **kwargs):

y = mx.nd.sign(x)
b = 0.5 * (y + 1)
mu_n = F.mean(b_n, axis=0)
loss = F.square(mu_n - 0.5)

To further validate this, I’m overfitting to the data (only with second loss) by only feeding in 8 samples (equalling batch_size) repeatedly

ThomasDelteil · May 31, 2018, 4:58pm

I think you are missing a word here.

Also in the code you pasted, b_n is not defined.

You said you want to minimize the mean over the number of bits, as in the mean across samples for a given column, or minimize the mean over the number of bits for each sample?

Because right now you are using F.mean(..., axis=0) which gives you the mean across samples. I think you might want to use F.mean(..., axis=1).

using your notations and n=4:

b = 8
n = 4
x = mx.ndarray.random.uniform(shape=(b, n)).round()

x
[[ 1.  1.  1.  0.]
 [ 1.  0.  1.  0.]
 [ 1.  0.  0.  1.]
 [ 1.  1.  1.  0.]
 [ 1.  0.  0.  1.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  1.]
 [ 1.  0.  0.  0.]]
<NDArray 8x4 @cpu(0)>


x.mean(axis=0)
[ 0.75  0.25  0.5   0.5 ]
<NDArray 4 @cpu(0)>

x.mean(axis=1)
[ 0.75  0.5   0.5   0.75  0.5   0.25  0.5   0.25]
<NDArray 8 @cpu(0)>

simomaur · June 1, 2018, 9:27am

yes, you’re correct. I denoted b as the batch_size and b_n as the binary array with cols:b and rows:n.
so the aim is to evenly distribute the bits across the cols (so axis=0 should be correct?), but after several epochs several bins (cols) only contains all zeros or ones.
for 1470 samples (training set), do I need to increase the batch_size and/or num of bits to make a valid conclusion about the output?
this is why I tried to use exactly the same 8 training samples to overfit to the data, which would allows us to see if the optimization function is correct.
again as noted in DeepBit the second loss they are optimizing first calculates the mean for each bin (so col!) and then calculates the mean of squared(mu_n - 0.5) over these bins

simomaur · June 5, 2018, 9:59pm

one remark:

Topic		Replies	Views
How to implement custom loss functions without label assignments (unsupervised)? Discussion	10	2800	May 14, 2018
How to write a customized symbol loss? Discussion	5	1167	April 5, 2019
Custom loss function from a pre-trained network Discussion	2	832	March 23, 2018
Multiple output layers and multiple losses handling Discussion	2	1346	June 13, 2018
Custom Loss + L2 Regularization Discussion	3	1396	July 6, 2018

Multiple losses

Related Topics