Defining y = 1/(1+exp(x)) and using auto differentiation to calculate dy/dx can easily generate an nan, when x is large; see example below.
While I probably understand why (infinity dividing by infinity?), this is somewhat a hassle for many types of loss calculation. The derivative for the function y is well defined even for a fairly large x.
I am using mxnet 1.5.1.
****** simple code ***************
import numpy as np
import mxnet as mx
from mxnet import nd, autograd
x = nd.array([-100000., 0., 1., 100.] )
x.attach_grad()
with autograd.record():
y = 1. / (1. + nd.exp(x))
y.backward()
dx = x.grad.asnumpy()
print(dx)