Hi,
I’m training a resnet18v2 from gluon model zoo on FashionMNIST.
On ctx = mx.gpu(0)
, the val accuracy is 88%
On ctx = mx.cpu()
, the val accuracy is 30%
train accuracy is around the same for both. I litterally don’t change anything in the code apart from the ctx above.
platform is sagemaker p3.2xl notebook, with mxnet 1.4.0
below my print statements out of the training loop:
on GPU
Epoch 0 Acc 0.7936166666666666
17.39394474029541
Epoch 1 Acc 0.8770833333333333
14.829176187515259
Epoch 2 Acc 0.8936
14.931753396987915
('accuracy', 0.8829)
CPU times: user 2min 6s, sys: 1min 41s, total: 3min 47s
Wall time: 1min 23s
on CPU:
Epoch 0 Acc 0.7714333333333333
148.0548324584961
Epoch 1 Acc 0.8716333333333334
147.9010841846466
Epoch 2 Acc 0.8872
147.93051600456238
('accuracy', 0.3004)
since training accuracy is the same for CPU and GPU, I suspect the issue is with validation function? which is this one:
def test(ctx, net, test_data):
metric = mx.metric.Accuracy()
for batch in test_data:
data = gluon.utils.split_and_load(data=batch[0], ctx_list=[ctx], batch_axis=0)
label = gluon.utils.split_and_load(data=batch[1], ctx_list=[ctx], batch_axis=0)
# populate prediction (outputs) and actuals (label) for this batch
outputs = []
for x in data:
outputs.append(net(x))
metric.update(label, outputs)
return metric.get()
What is going on?