How to diagnose NaN in training a model

Hello, I am quite new to neural networks in general. I am currently participating in a reinforcement learning challenge to train a bomberman agent. I am quite happy with the performance of my agent so far. During the last training rounds it quickly made progress. But now I am at a point where no matter what I do, whenever I run a training round the agent ends up producing NaN values. I am not sure how to diagnose that further.

Just in case you’re interested, the agent code is here: https://github.com/cs224/bomberman_rl/tree/master/agent_code/agent_011_shred
There is also a visualization of the network structure here: https://nbviewer.jupyter.org/github/cs224/bomberman_rl/blob/master/agent_code/agent_011_shred/0000-network-structure-visualization.ipynb?flush_cache=true
And a youtube video of the current state of the agent is here: https://youtu.be/bC2APj4xf_0

I already tried to use different learning rates, different optimizers (Adam, SGD, …) and different loss functions (L2, HuberLoss). But no matter what, whenever I run now a training round I end up with NaN values in the “output = self.model(x_nf, x_nchw)” step here: https://github.com/cs224/bomberman_rl/blob/master/agent_code/agent_011_shred/model/model_base_mx.py#L783

Any ideas or suggestions on how to go forward? The only next idea I have is to replace the ELU activation with ReLU, but this would require to relearn the whole network again.

Thanks!
Christian

The NaN values could be caused by exploding gradients. Debugging your code with conditional breakpoints may help to understand what is causing the NaN values. Here is a great tutorial on how to debug MXNet models: https://www.youtube.com/watch?v=6-dOoJVw9_0

Thank you very much for the link to the video. I actually watched it already earlier and it helped me to debug network structure issues. But I do not see how it would help me to decide if I am running into an “exploding gradients” problem? Going by hand through all layers of my network and looking at the gradients is infeasible for larger networks.

I was thinking if there are some “statistical checks”, e.g. based on the distribution of parameters or gradients that I could run. I also had a look at mxboard: https://github.com/awslabs/mxboard and started to use it. There is also an example that shows how to log the gradients in the network: https://github.com/awslabs/mxboard/blob/master/examples/mnist/train_mnist.py#L134
But without knowing what looks “normal” vs. what looks “abnormal” I am not able to make a lot of use out of these graphs.

Even when I would detect that there is an “exploding gradients” problem in a given layer, I would not know what to do about it? Is there some “cookbook” that would walk you through a whole “exploding gradients” problem from looking at the parameters through to problem resolution?

Thanks a lot!
Christian