How to diagnose NaN in training a model

#1

Hello, I am quite new to neural networks in general. I am currently participating in a reinforcement learning challenge to train a bomberman agent. I am quite happy with the performance of my agent so far. During the last training rounds it quickly made progress. But now I am at a point where no matter what I do, whenever I run a training round the agent ends up producing NaN values. I am not sure how to diagnose that further.

Just in case you’re interested, the agent code is here: https://github.com/cs224/bomberman_rl/tree/master/agent_code/agent_011_shred
There is also a visualization of the network structure here: https://nbviewer.jupyter.org/github/cs224/bomberman_rl/blob/master/agent_code/agent_011_shred/0000-network-structure-visualization.ipynb?flush_cache=true
And a youtube video of the current state of the agent is here: https://youtu.be/bC2APj4xf_0

I already tried to use different learning rates, different optimizers (Adam, SGD, …) and different loss functions (L2, HuberLoss). But no matter what, whenever I run now a training round I end up with NaN values in the “output = self.model(x_nf, x_nchw)” step here: https://github.com/cs224/bomberman_rl/blob/master/agent_code/agent_011_shred/model/model_base_mx.py#L783

Any ideas or suggestions on how to go forward? The only next idea I have is to replace the ELU activation with ReLU, but this would require to relearn the whole network again.

Thanks!
Christian

#2

The NaN values could be caused by exploding gradients. Debugging your code with conditional breakpoints may help to understand what is causing the NaN values. Here is a great tutorial on how to debug MXNet models: https://www.youtube.com/watch?v=6-dOoJVw9_0

#3

Thank you very much for the link to the video. I actually watched it already earlier and it helped me to debug network structure issues. But I do not see how it would help me to decide if I am running into an “exploding gradients” problem? Going by hand through all layers of my network and looking at the gradients is infeasible for larger networks.

I was thinking if there are some “statistical checks”, e.g. based on the distribution of parameters or gradients that I could run. I also had a look at mxboard: https://github.com/awslabs/mxboard and started to use it. There is also an example that shows how to log the gradients in the network: https://github.com/awslabs/mxboard/blob/master/examples/mnist/train_mnist.py#L134
But without knowing what looks “normal” vs. what looks “abnormal” I am not able to make a lot of use out of these graphs.

Even when I would detect that there is an “exploding gradients” problem in a given layer, I would not know what to do about it? Is there some “cookbook” that would walk you through a whole “exploding gradients” problem from looking at the parameters through to problem resolution?

Thanks a lot!
Christian