My training code using hybrid blocks works fine on the CPU. However, when I change it to use the gpu version, I am getting “nan” for the value of loss.
I investigated that a bit deeper and narrowed down the issue to the squaring operation (i.e., **2) under gpu. Here is the value of (outY - batchY) with cpu:
[[ nan 5.48341084 nan ..., 1.42631352 nan
0.01532905]
[ 2.85896707 nan nan ..., nan 0.68346757
nan]
So, the squaring operation is introducing the "nan"s while using gpu. I would have expected some small differences between cpu and gpu due to precision issues. But, I am surprised to see "nan"s while squaring these “normal” numbers.
Actually, it is clear what the bug is: squaring negative numbers using “**2” operation is producing "nan"s. Just doing something like what is indicated in the code above should reproduce the issue. I was able to get around the issue by implementing squaring as “x * x” instead of “x ** 2”.
Could that be specific to the platform/MXNet Version/Cuda version/gpu?
These are what I have:
Platform: Windows
MXNet version: Whatever gets installed while doing “pip install mxnet-cu90” into an Anaconda distribution.
Cuda: 9.0
GPU: GeForce GTX 1080