Periodic Loss Value when training with “step” learning rate policy


#1

When training deep CNN, a common way is to use SGD with momentum with a “step” learning rate policy (e.g. learning rate set to be 0.1,0.01,0.001… at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.

That is the periodic training loss value https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png

The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png

However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch

So I thought it might be the problem of data shuffling, but it cannot explain why it doesn’t happen in the first stage. Actually I used ImageRecordIter as the DataIter and reset it after every epoch, is there anything I missed or set mistakenly?

train_iter = mx.io.ImageRecordIter(
    path_imgrec=recPath,
    data_shape=dataShape,
    batch_size=batchSize,
    last_batch_handle='discard',
    shuffle=True,
    rand_crop=True,
    rand_mirror=True)

The codes for training and loss evaluation:

while True:
    train_iter.reset()
    for i,databatch in enumerate(train_iter):
                globalIter += 1
        mod.forward(databatch,is_train=True)
        mod.update_metric(metric,databatch.label)
        if globalIter % 100 == 0:
                    loss = metric.get()[1]
                    metric.reset()
                mod.backward()
                mod.update()

Actually the loss can converge, but it takes too long. I’ve suffered from this problem for a long period of time, on different network and different datasets. I didn’t have this problem when using Caffe. Is this due to the implementation difference?


Originally asked here: stackoverflow/questions/46704238/mxnetperiodic-loss-value-when-training-with-step-learning-rate-policy



#2

Can you reproduce this for a simple mnist example? Can you post the entire script? (the indentation seems to be a bit off in what you posted above)


#3

What is your momentum parameter value? Is it default value 0.9? What is your batch size?


#4

what is your weight decay setting? I wonder if the weight decay is too large, once the lr is reduced, the weight decay takes over the learning and cause the loss to shoot up. Once the loss is high enough, the normal loss is large enough to overcome weight decay, combining this with momentum, it may oscillate.