Is it possible to back propagate twice?

When we have 2 output layers and 2 loss functions, we can add them together and back propagate. Is it possible to run the back propagate twice with the 2 different losses, instead of summing them up together and call loss.backward()?

Yes you can. You need to pass retain_graph=True to backward() call of the first loss, otherwise the computational graph is cleared and you get an error when you call backward() on the second loss. I’m imagining that you’d also want the gradients to be summed together. In that case, you need to set grad_req parameter to ‘add’ and manually call zero_grad() on parameters after optimization step:

# Set grad_req to 'add' to accumulate gradients with each backward() call
for p in net.collect_params().values():
    p.grad_req = 'add'

# Training loop happens here
    # Optimize weights after forward/backward
    trainer.step(batch_size)

    # reset gradients
    for p in net.collect_params().values():
        p.zero_grad()
2 Likes

Thank you so much for your response! I’ll try this and update whenever I have results.

Is it possible to back propagate different losses to different layers? For example, my model have 2 output layers, and I want to use the loss_1 to optimize output_1 and loss_2 to optimize output_2, how can I do this? Or this is merely impossible? Thank you very much!

You can have as many losses as you want attached to as many branches of your network. You would simply calculate each loss under the autograd.record() scope and call loss1.backward(retain_graph=True) followed by loss2.backward(). Once all backward() calls are called, you can call trainer.step()

1 Like

Thank you for your response! How would I attach 1 loss to 1 specific layer of network?

I’m quite confused about your question. Is this what you’re looking for?

net_base = gluon.nn.HybridSequential()
with net_base.name_scope():
    net_base.add(gluon.nn.Conv2D(channels=256, kernel_size=3, layout='NCHW', use_bias=False, activation='relu'))
    net_base.add(gluon.nn.Conv2D(channels=256, kernel_size=3, layout='NCHW', use_bias=False, activation='relu'))

net1 = gluon.nn.Dense(10)

net2 = gluon.nn.HybridSequential()
with net2.name_scope():
    net2.add(gluon.nn.Dense(2048, activation='relu'))
    net2.add(gluon.nn.Dense(100, activation='relu'))

net_base.initialize()
net1.initialize()
net2.initialize()

net_params = net_base.collect_params()
net_params.update(net1.collect_params())
net_params.update(net2.collect_params())

for p in net_params.values():
    p.grad_req = 'add'

trainer = gluon.Trainer(net_params, optimizer='sgd')

ce_loss1 = gluon.loss.SoftmaxCELoss()
ce_loss2 = gluon.loss.SoftmaxCELoss()

data = nd.random.uniform(shape=(16, 3, 100, 100))
label1 = nd.cast(nd.random.uniform(shape=(16,)) * 10, dtype='int32')
label2 = nd.cast(nd.random.uniform(shape=(16,)) * 100, dtype='int32')
with autograd.record():
    out1 = net1(net_base(data))
    out2 = net2(net_base(data))
    loss1 = ce_loss1(out1, label1)
    loss2 = ce_loss2(out2, label2)
loss1.backward(retain_graph=True)
loss2.backward()
trainer.step(batch_size=16)
# Manually zero the gradients for processing the next batch
for p in net_params.values():
    p.zero_grad()
3 Likes

yeah sorry I was being silly. Thank you for being patient and helpful.

Glad I could help :slight_smile: Also I fixed the example to set grad_req to add.

1 Like