When training deep CNN, a common way is to use **SGD with momentum with a “step” learning rate policy** (e.g. learning rate set to be 0.1,0.01,0.001… at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.

That is the periodic training loss value https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png

The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png

However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch

So I thought it might be the problem of data shuffling, but it cannot explain why it doesn’t happen in the first stage. Actually I used `ImageRecordIter`

as the `DataIter`

and reset it after every epoch, is there anything I missed or set mistakenly?

```
train_iter = mx.io.ImageRecordIter(
path_imgrec=recPath,
data_shape=dataShape,
batch_size=batchSize,
last_batch_handle='discard',
shuffle=True,
rand_crop=True,
rand_mirror=True)
```

The codes for training and loss evaluation:

```
while True:
train_iter.reset()
for i,databatch in enumerate(train_iter):
globalIter += 1
mod.forward(databatch,is_train=True)
mod.update_metric(metric,databatch.label)
if globalIter % 100 == 0:
loss = metric.get()[1]
metric.reset()
mod.backward()
mod.update()
```

Actually the loss can converge, but it takes too long. I’ve suffered from this problem for a long period of time, on different network and different datasets. I didn’t have this problem when using Caffe. Is this due to the implementation difference?

Originally asked here: stackoverflow/questions/46704238/mxnetperiodic-loss-value-when-training-with-step-learning-rate-policy