About stale gradient

BluebirdStory · April 28, 2019, 4:12am

mxnet version: 1.4.0
operation system: linux

Dear all:
Now I have set all ‘grad_req’ in all Parameter to ‘add’, hoping to accumulate gradient over N batches to solve the problem of memory limitations. This is my code:

However, I encounter this Warning when I run the training code:

UserWarning: Gradient of Parameter bn0_moving_mean on context gpu(0) has not been updated by backward since last step. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient

My questions is: is it because my code that leads to this Warning? if True, how can I modify my code, if False, what probably could lead to this warning?

Thanks

BluebirdStory · April 28, 2019, 4:14am

And now I am very frustrated because my training loss just not goes down

Any help or advice will be appreciated, thank you guys

mouryarishik · April 28, 2019, 1:05pm

This message means that there are some parameters in you computation that has no effect in your final output, i.e. if those parameters are changed then there would be no effect in the final output.
Sometimes this is intentional by the user but sometimes there could be some bug(and I think that’s the case with your code), therefore mxnet shows this message to tell the user about this situation. You can hide this message by passing ignore_stale_grad=True, that is training.step(batch_size, ignore_stale_grad=True)

I don’t know how you are defining backbone_net, so I can’t exactly say where the bug is.

And your loss is not going down because your learning_rate is toooooo high.

BluebirdStory · April 28, 2019, 1:16pm

Hi, my friend, the loss finally goes down after the 1000 iteration,

but I still can not figure out the warning problem, here is how I define my Backbone_net:
I use the gluon.SymbolBlock interface in mxnet, i.e. I load the pretrained feature extracting network (symbol.json & .params) directly

Thank you for replying!!, I really appreciate it!

mouryarishik · April 28, 2019, 1:28pm

Above code looks okay to me. How your definitely gn.Attach_arcloss?

Check out this notebook that explains how to degub gluon code, I think this might be helpful.

BTW welcome to the community.

mouryarishik · April 28, 2019, 1:35pm

Try loading model like
gluon.nn.SymbolBlock.imports("./models/r50eir/model-symbol.json", ['data'], "./models/r50eir/model-0001.params", ctx=ctx_)

BluebirdStory · April 28, 2019, 1:35pm

Thank you bro!!! Here is how I define the gn.Attach_arcloss: in this HybridBlock, I just attach an arcloss layer to the backbone feature extractor net

BluebirdStory · April 28, 2019, 2:00pm

I make this change to gluon.SymbolBlock.imports, but the warning still exists

BluebirdStory · April 29, 2019, 3:04am

Anyone got any ideas??? T_T

mouryarishik · April 29, 2019, 4:47am

Sorry brother. I will try to find a solution.

BluebirdStory · April 29, 2019, 7:41am

Thank you soooooo much bro, you are my hero!!

BluebirdStory · April 29, 2019, 9:22am

Hey, bro!!! GOOD NEWS!!! I think I figure out the problem!!! The ‘grad_req’ of ‘bn0_moving_mean’ is ‘null’, which means it does not need gradient:

However, I write this line in my training code:

which compulsively change all ‘grad_req’ to ‘add’, and hence leads to the warning, after I modify my training code like this:

The warning is gone:grin:

mouryarishik · April 29, 2019, 9:53am

Finally… Well done. I think now your loss would go down a bit faster.

BluebirdStory · April 29, 2019, 9:56am

Yeah…Maybe, but I don’t think it will make a difference, because ‘moving_mean’ and ‘moving_var’ in a
batchnorm layer doesn’t need gradient, so ‘null’ or ‘add’ won’t make a difference,… Emmm… , am I right?

mouryarishik · April 29, 2019, 10:16am

Yep u r absolutely right. Batch normalisation has nothing to do with loss computation, it just helps to update our parameters faster.

BluebirdStory · April 29, 2019, 12:02pm

yeah bro, nice to have you for this issue, thank you for company!!!
Looking forward to share opinion with you again!!!

mouryarishik · April 29, 2019, 12:15pm

Thanks, I appreciate it.

mahinqureshi · October 19, 2020, 3:12pm

Im facing the same problem,
UserWarning: Gradient of Parameter ssd0_batchnorm0_beta on context gpu(0) has not been updated by backward since last step. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient

bt i think its affecting

Topic		Replies	Views
Implementation of weighted softmax by extending mx.autograd.Function fails	2	650	September 2, 2019
Gradient fetching Discussion	2	586	May 31, 2018
How to implement the addtion of grad in the backback-propagating,how to add extra term (which is the gradient to middle net layer output) to the network	2	588	August 18, 2018
Aggregate gradients manually over n batches Gluon	26	6599	July 2, 2020
WGAN-gp: can't compute gradient penalty with gluon? Gluon	0	408	October 15, 2020

About stale gradient

Related Topics