Issue about training faster-rcnn using adam

I tried train my model using adam by modify the code as follow:

adam_optimizer = mx.optimizer.Adam()
optimizer_params = {‘wd’: 0.0,
‘learning_rate’: lr,
‘lr_scheduler’: lr_scheduler,
‘rescale_grad’: (1.0/batch_size)}
#train
mod.fit(train_data, eval_metric=eval_metrics, epoch_end_callback=epoch_end_callback,
batch_end_callback=batch_end_callback, kvstore=args.kvstore,
optimizer=adam_optimizer , optimizer_params=optimizer_params,
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch,
num_epoch=end_epoch)

but, the training loss at the beginning became very large (Train-RPNAcc = 0.843564, RPNLogLoss=1.250399, RPNL1Loss=30.644248, RCNNAcc=0.818452, RCNNLogLoss=2.148393, RCNNL1Loss=90.618619). If I change it back by using sgd, the loss value would became nomall (Train-RPNAcc = 0.904762, RPNLogLoss=0.297273, RPNL1Loss=1.345441, RCNNAcc=0.849330, RCNNLogLoss=0.461176, RCNNL1Loss=1.354734).
It seems that the pretrained model has not been load correctly when I using adam.
why?
thanks a lot!

If the only change you’ve made to the two training scripts is changing the optimizer, that would have no impact on loading pre-trained parameters. Adam does require different set of hyper-parameters, including the learning rate, than normal SGD. For example, rescale_grad has no impact in Adam because of how the optimization algorithm. I would try reducing the learning_rate by a factor of 10 or more until you see proper convergence behavior and and let Adam “adapt”.

I have tried reducing the learning_rate to 1e-6, but it doesn’t work although.
Adam is not suit for this task? but it’s OK with caffe.
so confused