Inconsistent results on GPU

MXNet: 1.6.0 (cu102) & 1.5.1 (cu101)
GPUs: RTX 2080 & Tesla K40m
Model used: VGG16, Resnet18_v1, Mobilenet1_0

Hello,
I used models mentioned above and trained them(non pre-trained). When the models are trained on the CPU, I obtained consistent results(sum of all weights) since the random seed is fixed. However, to my surprise, the results are inconsistent when trained on GPU.

I only can get same result from VGG16 on GPU with (MXNET_CUDNN_AUTOTUNE_DEFAULT =0) option and the others(Resnet18_v1, Mobilenet1_0) results are varying on GPU.

Additionally, I found out that mobilenet(gluon model_zoo mobilnet.py) produce the same results in the following cases

  1. Annotate Conv2D(line: 50) or
  2. Annotate BatchNorm(line: 51) or
  3. Annotate RELU6(line: 53) or
  4. Annotate Annotate _add_conv_dw(line: 126)

My test code for this issue is as below.

import os
os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '0'
os.environ['MXNET_GPU_CUDNN_DROPOUT_STATE_COPY'] = '1'
os.environ['MXNET_GPU_PARALLEL_RAND_COPY'] = '1'

import numpy as np
import mxnet as mx
from mxnet import gluon, autograd

def trans(data, label):
	return mx.nd.transpose(data.astype(np.float32), (2, 0, 1))/255, label.astype(np.uint8)

def sum_of_weight(net):
	params = net.collect_params()
	accu = mx.nd.zeros(1, ctx)
	for key in params.keys():
		accu+= mx.nd.sum(params[key].data())
	print(accu)

def test_model(net, data):
	trainer = gluon.Trainer(net.collect_params(), optimizer="sgd", optimizer_params={'learning_rate': 0.01})
	loss = gluon.loss.SoftmaxCrossEntropyLoss()
	
	stop = 5
	for i, (data, label) in enumerate(data):
		if i == stop:
			break
		
		with autograd.record():
			losses = loss(net(data.copyto(ctx)), label.copyto(ctx))
		losses.backward()
		trainer.step(batch_size)
	
	sum_of_weight(net)



mx.random.seed(5)
np.random.seed(5)
batch_size = 256
ctx = mx.gpu()



net = mx.gluon.model_zoo.vision.mobilenet1_0() # resnet18_v1, vgg16
net.initialize(ctx=ctx)
net(mx.nd.zeros((1,3,32,32), ctx=ctx))
data = gluon.data.DataLoader(gluon.data.vision.CIFAR10(train=True, transform=trans), batch_size)


test_model(net, data)

The results of this code should be the same each time if the seed is completely fixed. However on the GPU, the results keep changing. Please help me solve this problem. Thanks you.