An illegal memory access

i have used mxnet (1.6.0) for face recogniton, but accidently it reports an error after 2 epochs during normal training:

Traceback (most recent call last):
File "train_0723.py", line 455, in <module>
   main()
 File "train_0723.py", line 451, in main
   train_net(args)
 File "train_0723.py", line 445, in train_net
   epoch_end_callback=epoch_cb)
 File "/home/user1/recognition/parall_module_local_v1_gluon_group.py", line 573, in fit
   self.update()
 File "/home/user1/recognition/parall_module_local_v1_gluon_group.py", line 406, in update
   mx.nd.waitall()
 File "/home/user1/miniconda3/lib/python3.7/site-packages/mxnet/ndarray/ndarray.py", line 200, in waitall
   check_call(_LIB.MXNDArrayWaitAll())
 File "/home/user1/miniconda3/lib/python3.7/site-packages/mxnet/base.py", line 255, in check_call
   raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:32:38] /home/ubuntu/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess: CUDA: an illegal memory access was encountered
Stack trace:
 [bt] (0) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x6b41eb) [0x7f76131a51eb]
 [bt] (1) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37b2742) [0x7f76162a3742]
 [bt] (2) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37e3515) [0x7f76162d4515]
 [bt] (3) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37bf6d1) [0x7f76162b06d1]
 [bt] (4) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c2c10) [0x7f76162b3c10]
 [bt] (5) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37c2ea6) [0x7f76162b3ea6]
 [bt] (6) /home/user1/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x37bde84) [0x7f76162aee84]
 [bt] (7) /home/user1/miniconda3/bin/../lib/libstdc++.so.6(+0xc8421) [0x7f76aca9d421]
 [bt] (8) /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f76bb1f0609]

i haven’t got any clue to solve this error after googling, but only decrease my batch_size 400 to 360, and not sure whether it will encounter error again… still worried about that :frowning:

@Karl Do you have a repro script?