Learning rate doesnt decrease after resuming training in MXNets Gluon example


#1

Hi everyone,
My training stopped for some reason, and now that I want to resume it, the learning rate wont change!
I am following the Classification exmple here.
I believe I set all paramters correctly, which by the way is as follows :

DTYPE=float16 
BATCHSIZE=384 
WORKER=20
EPOCH=187
CHECKPOINT=params_model_mixup/0.3399-imagenet-186-best.states
PARAMS=params_model_mixup/0.3399-imagenet-186-best.params

python train_imagenet.py \
  --rec-train /media/void/SSD/ImageNet_DataSet/train/rec_train/train.rec --rec-train-idx /media/void/SSD/ImageNet_DataSet/train/rec_train/train.idx \
  --rec-val /media/void/SSD/ImageNet_DataSet/train/rec_val/val.rec --rec-val-idx /media/void/SSD/ImageNet_DataSet/train/rec_val/val.idx \
  --model model --mode hybrid \
  --lr 0.4 --lr-mode cosine --num-epochs 200 --batch-size $BATCHSIZE --num-gpus 1 -j $WORKER \
  --use-rec --dtype $DTYPE --warmup-epochs 0 --no-wd --label-smoothing --mixup \
  --save-dir params_model_mixup \
  --logging-file model_mixup.log --resume-states $CHECKPOINT --resume-params $PARAMS --resume-epoch $EPOCH 

As you can see below, the learning rate wont change! :

|Epoch[187] Batch [49]|Speed: 492.394147 samples/sec|rmse=0.019614|lr=0.004371|
|Epoch[187] Batch [99]|Speed: 603.372949 samples/sec|rmse=0.019578|lr=0.004371|
|Epoch[187] Batch [149]|Speed: 604.314057 samples/sec|rmse=0.019593|lr=0.004371|

What am I missing here?
any help is greatly appreciated


#2

You don’t seem to be doing anything obviously wrong here. Set a breakpoint (using import pdb; pdb.set_trace() or otherwise) on line 359 in train_imagenet.py script to confirm that the learning rate schedule is being updated.

lr_scheduler.update(i, epoch)