Training loss never changes but accuracy oscillates


#1

I am using mxnet to train a VQA model, the input is (6244,) vector and the output is a single label

During my epoch, the loss never change but the accuracy is oscillating in a small range, the first 5 epochs are

Epoch 1. Loss: 2.7262569132562255, Train_acc 0.06867348986554285
Epoch 2. Loss: 2.7262569132562255, Train_acc 0.06955649207304837
Epoch 3. Loss: 2.7262569132562255, Train_acc 0.06853301224162152
Epoch 4. Loss: 2.7262569132562255, Train_acc 0.06799116997792494
Epoch 5. Loss: 2.7262569132562255, Train_acc 0.06887417218543046

This is a multi-class classification problem, with each answer label stands for a class, so I use softmax as final layer and cross-entropy to evaluate the loss, the code of them are as follows

So why the loss never change?.. I just directly get if from cross_entropy

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.01})
loss = gluon.loss.SoftmaxCrossEntropyLoss()

epochs = 10
moving_loss = 0.
best_eva = 0
for e in range(epochs):
    for i, batch in enumerate(data_train):
        data1 = batch.data[0].as_in_context(ctx)
        data2 = batch.data[1].as_in_context(ctx)
        data = [data1, data2]
        label = batch.label[0].as_in_context(ctx)
        with autograd.record():
            output = net(data)
            cross_entropy = loss(output, label)
            cross_entropy.backward()
        trainer.step(data[0].shape[0])
        
        moving_loss = np.mean(cross_entropy.asnumpy()[0])

    train_accuracy = evaluate_accuracy(data_train, net)
    print("Epoch %s. Loss: %s, Train_acc %s" % (e, moving_loss, train_accuracy))

The eval function is as follows

def evaluate_accuracy(data_iterator, net, ctx=mx.cpu()):
numerator = 0.
denominator = 0.
metric = mx.metric.Accuracy()
data_iterator.reset()
for i, batch in enumerate(data_iterator):
    with autograd.record():
        data1 = batch.data[0].as_in_context(ctx)
        data2 = batch.data[1].as_in_context(ctx)
        data = [data1, data2]
        label = batch.label[0].as_in_context(ctx)
        output = net(data)

    metric.update([label], [output])
return metric.get()[1]

#2

Hi.

You’re doing your accuracy evaluation within the autograd.record() scope in your evaluate_accuracy function. That’s throwing off your network gradients for the next optimization step. Take out the with autograd.record() line in the function and you should see your loss start to converge. Also there’s no need to call data_iterator.reset() in your eval function.

Feel free to post an update if that doesn’t solve your issue or your run into other issues.


#3

I do it in following ways and get the output as follows

I keep the data_iteratore.reset() and remove the with autograd.record() in evaluation, the loss does not change and accuracy becomes zero

Epoch 1. Loss: 6.835763931274414, Train_acc 0.0
Epoch 2. Loss: 6.835763931274414, Train_acc 0.0
Epoch 3. Loss: 6.835763931274414, Train_acc 0.0
Epoch 4. Loss: 6.835763931274414, Train_acc 0.0
Epoch 5. Loss: 6.835763931274414, Train_acc 0.0

I remove both data_iteratore.reset() and with autograd.record() in evaluation, loss does not change and accuracy becomes nan

Epoch 1. Loss: 6.835763931274414, Train_acc nan
Epoch 2. Loss: 6.835763931274414, Train_acc nan
Epoch 3. Loss: 6.835763931274414, Train_acc nan
Epoch 4. Loss: 6.835763931274414, Train_acc nan
Epoch 5. Loss: 6.835763931274414, Train_acc nan

The with autograd.record() in training step could not be removed, otherwise a compile error will occur

One thing I have to add is that, in official VQA demo, https://gluon.mxnet.io/chapter08_computer-vision/visual-question-answer.html, it uses with autograd.record() in evaluation step and the loss is still converging, I tested its code and get some output as follows

Epoch 1. Loss: 2.0590806428121624, Train_acc 0.4791814630681818
Epoch 2. Loss: 1.7539432328664892, Train_acc 0.5143821022727273
Epoch 3. Loss: 1.4294043381950257, Train_acc 0.5496271306818182
Epoch 4. Loss: 1.1836000213868916, Train_acc 0.5796431107954545
Epoch 5. Loss: 1.1122687829740323, Train_acc 0.6065488873106061

I think I have to go back to the function of autograd.record


#4

I go back to the official tutorial and it says autograd.record is something holding the gradient. Is this meaning conflicts with your saying throwing off the gradients


#5

Hi, I meant that you should remove the with autograd.record() in the evaluation function ONLY not the training loop. You don’t need to record gradients to calculate accuracy but you need it in the training part to perform backprop and take an optimization step.

Although, I see that in the tutorial autograd is used in the evaluation function so that’s probably not your issue. Looks like you’re missing the data_train.reset() line in your training loop though.


#7

Yes, I think I did exactly what you say. See the third post here Training loss never changes but accuracy oscillates. I did not specify I only remove the autograd in evaluation, and I added it in the post, sorry for the unclearness


#8

I see. And did you try adding data_train.reset() in your training loop. Like in the example? The example has

for e in range(epochs):
    data_train.reset()
    for i, batch in enumerate(data_train):

#11

OHHHH…Thank you so much, I think this is what cause the problem

By the way, may you explain why this problem occurs if we do not reset the dataset?


#12

The model begin to converge, all the problems were caused by the missing of reset().

One more question, how will the training continue if we do not reset at the beginning of the epoch? I think at the end of the last epoch it has already moved to the last batch of the data, how could the training continue without throwing an error?


#13

My guess is that training does not continue after the first epoch, and that is exactly why moving_loss never changes. This can be easily tested by trying printing something inside the training loop.


#15

This is reasonable, but why the accuracy is oscillating?.. and just add, after the first epoch, the sequential epochs still cost the similar amount of time with the first epoch