GPU version producing nan for loss

My training code using hybrid blocks works fine on the CPU. However, when I change it to use the gpu version, I am getting “nan” for the value of loss.

Any idea about the cause/fix?

Here is the relevant part of code:

with autograd.record():            
    outY = network(*args)
    totalLoss = nd.mean((outY - batchY)**2) + network.cost()
totalLoss.backward()
trainer.step(1)
lossValue = totalLoss.asnumpy()

The “lossValue” is coming out to be a nan while using gpu but is a valid number while using cpu.

I investigated that a bit deeper and narrowed down the issue to the squaring operation (i.e., **2) under gpu. Here is the value of (outY - batchY) with cpu:

[[-0.52970028  2.34094644 -2.03111124 ...,  1.19498491 -0.29986447
   0.12365768]
 [ 1.69178033 -1.94142985 -1.00713968 ..., -0.72571653  0.82778275
  -1.31922102]
....

Value of (outY - batchY)**2 with the cpu:

[[  2.80582398e-01   5.48003006e+00   4.12541294e+00 ...,   1.42798889e+00
    8.99187028e-02   1.52912224e-02]
 [  2.86212063e+00   3.76914978e+00   1.01433039e+00 ...,   5.26664495e-01
    6.85224295e-01   1.74034405e+00]
...

Value of (outY - batchY) with gpu:

[[-0.52780288  2.34166837 -2.03344011 ...,  1.1942836  -0.29883277
   0.12381052]
 [ 1.69084799 -1.94114697 -1.00002015 ..., -0.72588837  0.82672089
  -1.32586646]

Value of (outY - batchY)**2 with gpu:

[[        nan  5.48341084         nan ...,  1.42631352         nan
   0.01532905]
 [ 2.85896707         nan         nan ...,         nan  0.68346757
          nan]

So, the squaring operation is introducing the "nan"s while using gpu. I would have expected some small differences between cpu and gpu due to precision issues. But, I am surprised to see "nan"s while squaring these “normal” numbers.

This is with Cuda version 9.0 on Windows,

Any idea why this is happening?

Can you provide a reproducer script and file a github issue? Looks like a bug.

Actually, it is clear what the bug is: squaring negative numbers using “**2” operation is producing "nan"s. Just doing something like what is indicated in the code above should reproduce the issue. I was able to get around the issue by implementing squaring as “x * x” instead of “x ** 2”.

Seems to be more involved.
Simple code does not reproduce the problem on a p2 instance.
import mxnet as mx
from mxnet import nd
from mxnet import gpu

def run(x):
return [x**2]

x = nd.array([[-0.52780288, 2.34166837, -2.03344011, 1.1942836, -0.29883277, 0.12381052],
[ 1.69084799, -1.94114697, -1.00002015, -0.72588837, 0.82672089, -1.32586646]])
y = x**2
print(y)
x1 = x.copyto(gpu(0))
y1= run(x1)
print(y1)
(mxnet_p36) ubuntu@ip-172-31-94-140:~$ python3 test.py

[[ 0.2785759 5.48341084 4.13487864 1.42631328 0.08930103 0.01532905]
[ 2.85896683 3.76805162 1.00004029 0.52691394 0.68346745 1.75792181]]
<NDArray 2x6 @cpu(0)>
[
[[ 0.2785759 5.48341036 4.13487864 1.4263134 0.08930103 0.01532905]
[ 2.85896683 3.76805162 1.00004029 0.52691394 0.68346745 1.75792181]]
<NDArray 2x6 @gpu(0)>]

Can you provide more details?

Could that be specific to the platform/MXNet Version/Cuda version/gpu?

These are what I have:
Platform: Windows
MXNet version: Whatever gets installed while doing “pip install mxnet-cu90” into an Anaconda distribution.
Cuda: 9.0
GPU: GeForce GTX 1080

I can confirm this issue on a windows platform.

Platform: Windows 7
MXNet version: 1.1.0 (from “pip install mxnet-cu80”)
Cuda: 8.0
GPU: Quadro M1000M
Python version: 3.5.2

Here is a simple example https://gist.github.com/rrblogdatascience/08d922d32ff0e8c4f4de84dc8e9dd666.