Defining y = 1/(1+exp(x)) and using auto differentiation to calculate dy/dx can easily generate an nan, when x is large; see example below.

While I probably understand why (infinity dividing by infinity?), this is somewhat a hassle for many types of loss calculation. The derivative for the function y is well defined even for a fairly large x.

I am using mxnet 1.5.1.

# ****** simple code ***************

import numpy as np

import mxnet as mx

from mxnet import nd, autograd

x = nd.array([-100000., 0., 1., 100.] )

x.attach_grad()

with autograd.record():

y = 1. / (1. + nd.exp(x))

y.backward()

dx = x.grad.asnumpy()

print(dx)