Gluon val accuracy VERY different on CPU than GPU?

Hi,

I’m training a resnet18v2 from gluon model zoo on FashionMNIST.

On ctx = mx.gpu(0), the val accuracy is 88%

On ctx = mx.cpu(), the val accuracy is 30%

train accuracy is around the same for both. I litterally don’t change anything in the code apart from the ctx above.

platform is sagemaker p3.2xl notebook, with mxnet 1.4.0

below my print statements out of the training loop:

on GPU

Epoch 0 Acc 0.7936166666666666
17.39394474029541

Epoch 1 Acc 0.8770833333333333
14.829176187515259

Epoch 2 Acc 0.8936
14.931753396987915

('accuracy', 0.8829)
CPU times: user 2min 6s, sys: 1min 41s, total: 3min 47s
Wall time: 1min 23s

on CPU:

Epoch 0 Acc 0.7714333333333333
148.0548324584961

Epoch 1 Acc 0.8716333333333334
147.9010841846466

Epoch 2 Acc 0.8872
147.93051600456238

('accuracy', 0.3004)

since training accuracy is the same for CPU and GPU, I suspect the issue is with validation function? which is this one:

def test(ctx, net, test_data):
    
    metric = mx.metric.Accuracy()

    for batch in test_data:
        data = gluon.utils.split_and_load(data=batch[0], ctx_list=[ctx], batch_axis=0)
        label = gluon.utils.split_and_load(data=batch[1], ctx_list=[ctx], batch_axis=0)
        
        # populate prediction (outputs) and actuals (label) for this batch
        outputs = []
        for x in data:
            outputs.append(net(x))
        metric.update(label, outputs)
        
    return metric.get()

What is going on?

Very stange! Can you try with identical models (i.e. same parameters), one where params are on mx.cpu(), the other on mx.gpu(0). I don’t see anything obviously wrong with your test function.

It seems that there is a bug on the batchnorm operator training on CPU :no_mouth:
https://github.com/apache/incubator-mxnet/issues/14357

2 Likes