About optimizer(mxnet version is 1.2.0)

According to published paper, it seems that nadam method is the best method for optimize, but when I tried nadam optimizer with the mnist example, wired things occured, accuracy could decend quickly and finally dropped to 0.2. I wonder that why optimizer return an even worse result? Could such thing be normal?

Here are test results
using ctx=cpu() optimizer=‘nadam’

INFO:root:Epoch[0] Batch [100]  Speed: 22101.11 samples/sec     accuracy=0.249307
INFO:root:Epoch[0] Batch [200]  Speed: 27835.34 samples/sec     accuracy=0.244600
INFO:root:Epoch[0] Batch [300]  Speed: 27830.02 samples/sec     accuracy=0.272900
INFO:root:Epoch[0] Batch [400]  Speed: 27835.30 samples/sec     accuracy=0.254700
INFO:root:Epoch[0] Batch [500]  Speed: 26670.45 samples/sec     accuracy=0.199500
INFO:root:Epoch[0] Train-accuracy=0.194444
INFO:root:Epoch[0] Time cost=2.359
INFO:root:Epoch[0] Validation-accuracy=0.208100
INFO:root:Epoch[1] Batch [100]  Speed: 21338.35 samples/sec     accuracy=0.190990
INFO:root:Epoch[1] Batch [200]  Speed: 7712.44 samples/sec      accuracy=0.191300
INFO:root:Epoch[1] Batch [300]  Speed: 6215.06 samples/sec      accuracy=0.208300
INFO:root:Epoch[1] Batch [400]  Speed: 6810.12 samples/sec      accuracy=0.199500
INFO:root:Epoch[1] Batch [500]  Speed: 6810.13 samples/sec      accuracy=0.202300
INFO:root:Epoch[1] Train-accuracy=0.194747
INFO:root:Epoch[1] Time cost=7.764
INFO:root:Epoch[1] Validation-accuracy=0.211700
INFO:root:Epoch[2] Batch [100]  Speed: 6883.36 samples/sec      accuracy=0.194455
INFO:root:Epoch[2] Batch [200]  Speed: 6809.94 samples/sec      accuracy=0.195800
INFO:root:Epoch[2] Batch [300]  Speed: 6668.24 samples/sec      accuracy=0.201100
INFO:root:Epoch[2] Batch [400]  Speed: 6668.24 samples/sec      accuracy=0.194900
INFO:root:Epoch[2] Batch [500]  Speed: 6810.11 samples/sec      accuracy=0.197300
INFO:root:Epoch[2] Train-accuracy=0.192727
INFO:root:Epoch[2] Time cost=8.842
INFO:root:Epoch[2] Validation-accuracy=0.208300
INFO:root:Epoch[3] Batch [100]  Speed: 6809.95 samples/sec      accuracy=0.191485
INFO:root:Epoch[3] Batch [200]  Speed: 6662.77 samples/sec      accuracy=0.194300
INFO:root:Epoch[3] Batch [300]  Speed: 6810.12 samples/sec      accuracy=0.207700
INFO:root:Epoch[3] Batch [400]  Speed: 6810.28 samples/sec      accuracy=0.182800
INFO:root:Epoch[3] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.195300
INFO:root:Epoch[3] Train-accuracy=0.193131
INFO:root:Epoch[3] Time cost=8.827
INFO:root:Epoch[3] Validation-accuracy=0.209400
INFO:root:Epoch[4] Batch [100]  Speed: 6883.33 samples/sec      accuracy=0.198119
INFO:root:Epoch[4] Batch [200]  Speed: 6810.21 samples/sec      accuracy=0.198100
INFO:root:Epoch[4] Batch [300]  Speed: 6809.96 samples/sec      accuracy=0.202300
INFO:root:Epoch[4] Batch [400]  Speed: 6738.59 samples/sec      accuracy=0.201000
INFO:root:Epoch[4] Batch [500]  Speed: 6809.96 samples/sec      accuracy=0.197300
INFO:root:Epoch[4] Train-accuracy=0.194747
INFO:root:Epoch[4] Time cost=8.826
INFO:root:Epoch[4] Validation-accuracy=0.211400
INFO:root:Epoch[5] Batch [100]  Speed: 6738.41 samples/sec      accuracy=0.203168
INFO:root:Epoch[5] Batch [200]  Speed: 6688.91 samples/sec      accuracy=0.196500
INFO:root:Epoch[5] Batch [300]  Speed: 6738.44 samples/sec      accuracy=0.203700
INFO:root:Epoch[5] Batch [400]  Speed: 6810.14 samples/sec      accuracy=0.201000
INFO:root:Epoch[5] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.197100
INFO:root:Epoch[5] Train-accuracy=0.194646
INFO:root:Epoch[5] Time cost=8.868
INFO:root:Epoch[5] Validation-accuracy=0.190300
INFO:root:Epoch[6] Batch [100]  Speed: 6668.08 samples/sec      accuracy=0.202772
INFO:root:Epoch[6] Batch [200]  Speed: 6810.12 samples/sec      accuracy=0.201900
INFO:root:Epoch[6] Batch [300]  Speed: 6810.11 samples/sec      accuracy=0.201300
INFO:root:Epoch[6] Batch [400]  Speed: 6810.13 samples/sec      accuracy=0.203800
INFO:root:Epoch[6] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.199600
INFO:root:Epoch[6] Train-accuracy=0.188485
INFO:root:Epoch[6] Time cost=8.826
INFO:root:Epoch[6] Validation-accuracy=0.191500
INFO:root:Epoch[7] Batch [100]  Speed: 6809.97 samples/sec      accuracy=0.192178
INFO:root:Epoch[7] Batch [200]  Speed: 6810.14 samples/sec      accuracy=0.195700
INFO:root:Epoch[7] Batch [300]  Speed: 6810.12 samples/sec      accuracy=0.192200
INFO:root:Epoch[7] Batch [400]  Speed: 6521.27 samples/sec      accuracy=0.198400
INFO:root:Epoch[7] Batch [500]  Speed: 6599.68 samples/sec      accuracy=0.191200
INFO:root:Epoch[7] Train-accuracy=0.194242
INFO:root:Epoch[7] Time cost=8.979
INFO:root:Epoch[7] Validation-accuracy=0.193200
INFO:root:Epoch[8] Batch [100]  Speed: 6605.41 samples/sec      accuracy=0.197525
INFO:root:Epoch[8] Batch [200]  Speed: 6527.83 samples/sec      accuracy=0.197900
INFO:root:Epoch[8] Batch [300]  Speed: 6627.89 samples/sec      accuracy=0.200800
INFO:root:Epoch[8] Batch [400]  Speed: 6668.57 samples/sec      accuracy=0.202500
INFO:root:Epoch[8] Batch [500]  Speed: 6599.34 samples/sec      accuracy=0.190000
INFO:root:Epoch[8] Train-accuracy=0.195051
INFO:root:Epoch[8] Time cost=9.073
INFO:root:Epoch[8] Validation-accuracy=0.208400
INFO:root:Epoch[9] Batch [100]  Speed: 6532.14 samples/sec      accuracy=0.200891
INFO:root:Epoch[9] Batch [200]  Speed: 6466.19 samples/sec      accuracy=0.200800
INFO:root:Epoch[9] Batch [300]  Speed: 6599.30 samples/sec      accuracy=0.199600
INFO:root:Epoch[9] Batch [400]  Speed: 6738.63 samples/sec      accuracy=0.205600
INFO:root:Epoch[9] Batch [500]  Speed: 6338.11 samples/sec      accuracy=0.198600
INFO:root:Epoch[9] Train-accuracy=0.197071
INFO:root:Epoch[9] Time cost=9.123
INFO:root:Epoch[9] Validation-accuracy=0.209500

using ctx=gpu() optimizer=‘nadam’

INFO:root:Epoch[0] Batch [100]  Speed: 14225.57 samples/sec     accuracy=0.296238
INFO:root:Epoch[0] Batch [200]  Speed: 14548.91 samples/sec     accuracy=0.412700
INFO:root:Epoch[0] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.434700
INFO:root:Epoch[0] Batch [400]  Speed: 14548.88 samples/sec     accuracy=0.423600
INFO:root:Epoch[0] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.423300
INFO:root:Epoch[0] Train-accuracy=0.424747
INFO:root:Epoch[0] Time cost=4.140
INFO:root:Epoch[0] Validation-accuracy=0.418100
INFO:root:Epoch[1] Batch [100]  Speed: 14548.89 samples/sec     accuracy=0.432772
INFO:root:Epoch[1] Batch [200]  Speed: 14548.89 samples/sec     accuracy=0.425300
INFO:root:Epoch[1] Batch [300]  Speed: 14225.57 samples/sec     accuracy=0.383200
INFO:root:Epoch[1] Batch [400]  Speed: 14548.90 samples/sec     accuracy=0.319800
INFO:root:Epoch[1] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.299300
INFO:root:Epoch[1] Train-accuracy=0.300707
INFO:root:Epoch[1] Time cost=4.155
INFO:root:Epoch[1] Validation-accuracy=0.271500
INFO:root:Epoch[2] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.275149
INFO:root:Epoch[2] Batch [200]  Speed: 14548.89 samples/sec     accuracy=0.325900
INFO:root:Epoch[2] Batch [300]  Speed: 14548.89 samples/sec     accuracy=0.357300
INFO:root:Epoch[2] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.347500
INFO:root:Epoch[2] Batch [500]  Speed: 14225.57 samples/sec     accuracy=0.361400
INFO:root:Epoch[2] Train-accuracy=0.342626
INFO:root:Epoch[2] Time cost=4.140
INFO:root:Epoch[2] Validation-accuracy=0.354200
INFO:root:Epoch[3] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.357426
INFO:root:Epoch[3] Batch [200]  Speed: 14548.88 samples/sec     accuracy=0.202400
INFO:root:Epoch[3] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.206400
INFO:root:Epoch[3] Batch [400]  Speed: 14548.89 samples/sec     accuracy=0.199300
INFO:root:Epoch[3] Batch [500]  Speed: 14548.92 samples/sec     accuracy=0.192800
INFO:root:Epoch[3] Train-accuracy=0.194141
INFO:root:Epoch[3] Time cost=4.124
INFO:root:Epoch[3] Validation-accuracy=0.193400
INFO:root:Epoch[4] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.192574
INFO:root:Epoch[4] Batch [200]  Speed: 13916.31 samples/sec     accuracy=0.192300
INFO:root:Epoch[4] Batch [300]  Speed: 14548.87 samples/sec     accuracy=0.206800
INFO:root:Epoch[4] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.203000
INFO:root:Epoch[4] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.200900
INFO:root:Epoch[4] Train-accuracy=0.194444
INFO:root:Epoch[4] Time cost=4.187
INFO:root:Epoch[4] Validation-accuracy=0.197000
INFO:root:Epoch[5] Batch [100]  Speed: 14548.88 samples/sec     accuracy=0.199604
INFO:root:Epoch[5] Batch [200]  Speed: 14225.58 samples/sec     accuracy=0.195100
INFO:root:Epoch[5] Batch [300]  Speed: 14225.59 samples/sec     accuracy=0.202500
INFO:root:Epoch[5] Batch [400]  Speed: 14548.88 samples/sec     accuracy=0.204800
INFO:root:Epoch[5] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.202400
INFO:root:Epoch[5] Train-accuracy=0.197071
INFO:root:Epoch[5] Time cost=4.155
INFO:root:Epoch[5] Validation-accuracy=0.206700
INFO:root:Epoch[6] Batch [100]  Speed: 14225.60 samples/sec     accuracy=0.196733
INFO:root:Epoch[6] Batch [200]  Speed: 14225.57 samples/sec     accuracy=0.195900
INFO:root:Epoch[6] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.205500
INFO:root:Epoch[6] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.206000
INFO:root:Epoch[6] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.204300
INFO:root:Epoch[6] Train-accuracy=0.194444
INFO:root:Epoch[6] Time cost=4.171
INFO:root:Epoch[6] Validation-accuracy=0.208300
INFO:root:Epoch[7] Batch [100]  Speed: 14225.60 samples/sec     accuracy=0.202772
INFO:root:Epoch[7] Batch [200]  Speed: 13620.24 samples/sec     accuracy=0.201000
INFO:root:Epoch[7] Batch [300]  Speed: 14225.58 samples/sec     accuracy=0.207600
INFO:root:Epoch[7] Batch [400]  Speed: 13916.32 samples/sec     accuracy=0.205500
INFO:root:Epoch[7] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.200200
INFO:root:Epoch[7] Train-accuracy=0.192323
INFO:root:Epoch[7] Time cost=4.249
INFO:root:Epoch[7] Validation-accuracy=0.191600
INFO:root:Epoch[8] Batch [100]  Speed: 14548.90 samples/sec     accuracy=0.198713
INFO:root:Epoch[8] Batch [200]  Speed: 14225.56 samples/sec     accuracy=0.196500
INFO:root:Epoch[8] Batch [300]  Speed: 13916.36 samples/sec     accuracy=0.204200
INFO:root:Epoch[8] Batch [400]  Speed: 14548.87 samples/sec     accuracy=0.201800
INFO:root:Epoch[8] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.197700
INFO:root:Epoch[8] Train-accuracy=0.194949
INFO:root:Epoch[8] Time cost=4.202
INFO:root:Epoch[8] Validation-accuracy=0.208800
INFO:root:Epoch[9] Batch [100]  Speed: 14548.89 samples/sec     accuracy=0.202376
INFO:root:Epoch[9] Batch [200]  Speed: 14225.59 samples/sec     accuracy=0.201400
INFO:root:Epoch[9] Batch [300]  Speed: 14548.89 samples/sec     accuracy=0.205800
INFO:root:Epoch[9] Batch [400]  Speed: 14225.57 samples/sec     accuracy=0.203800
INFO:root:Epoch[9] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.200600
INFO:root:Epoch[9] Train-accuracy=0.193939
INFO:root:Epoch[9] Time cost=4.171
INFO:root:Epoch[9] Validation-accuracy=0.206100

several days ago I found that thing, but I recognized such thing as a incorrect learning rate, in this example, just delete learning rate, and things will be better.
but now, such thing coming back.
with batch_size=10000,ctx=mx.gpu(),optimizer=‘nadam’,using default learning rate

......
INFO:root:Epoch[25] Validation-accuracy=0.988000
INFO:root:Epoch[26] Train-accuracy=0.991017
INFO:root:Epoch[26] Time cost=0.984
INFO:root:Epoch[26] Validation-accuracy=0.986100
INFO:root:Epoch[27] Train-accuracy=0.987683
INFO:root:Epoch[27] Time cost=0.969
INFO:root:Epoch[27] Validation-accuracy=0.976800
INFO:root:Epoch[28] Train-accuracy=0.786283
INFO:root:Epoch[28] Time cost=0.984
INFO:root:Epoch[28] Validation-accuracy=0.106700
INFO:root:Epoch[29] Train-accuracy=0.103833
INFO:root:Epoch[29] Time cost=0.984
INFO:root:Epoch[29] Validation-accuracy=0.089200
INFO:root:Epoch[30] Train-accuracy=0.097083
INFO:root:Epoch[30] Time cost=0.984
INFO:root:Epoch[30] Validation-accuracy=0.100900
INFO:root:Epoch[31] Train-accuracy=0.102583
INFO:root:Epoch[31] Time cost=0.984
......

@Neutron can you share your entire training code so I can try to reproduce and provide you with tips to solve your issue?

It does look like your learning rate could be too high and the training is diverging.

I’m so sorry to replied so late. and maybe due to i change the mxnet from mxnet-cu91 to mxnet-cu92(with newest cuda v9.2 installed) i cannot reproduce the result.
but use the code below could easily find out that nadam optimizer did not stable.

import mxnet as mx
mnist = mx.test_utils.get_mnist()

# Fix the seed
mx.random.seed(1)

# Set the compute context, GPU is available otherwise CPU
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

batch_size = 10000
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)

data = mx.sym.var('data')
# Flatten the data from 4-D shape into 2-D (batch_size, num_channel*width*height)
data = mx.sym.flatten(data=data)

# The first fully-connected layer and the corresponding activation function
fc1  = mx.sym.FullyConnected(data=data, num_hidden=128)
act1 = mx.sym.Activation(data=fc1, act_type="relu")

# The second fully-connected layer and the corresponding activation function
fc2  = mx.sym.FullyConnected(data=act1, num_hidden = 64)
act2 = mx.sym.Activation(data=fc2, act_type="relu")

# MNIST has 10 classes
fc3  = mx.sym.FullyConnected(data=act2, num_hidden=10)
# Softmax with cross entropy loss
mlp  = mx.sym.SoftmaxOutput(data=fc3, name='softmax')

import logging
logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
# create a trainable module on compute context
mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='nadam',  # use nadam to train
              eval_metric='acc',  # report accuracy during training
              batch_end_callback = mx.callback.Speedometer(batch_size, 2), # output progress for each 20000 data batches
              num_epoch=1000)  # train for at most 1000 dataset passes

test_iter = mx.io.NDArrayIter(mnist['test_data'], None, batch_size)
prob = mlp_model.predict(test_iter)
assert prob.shape == (10000, 10)

test_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
# predict accuracy of mlp
acc = mx.metric.Accuracy()
mlp_model.score(test_iter, acc)
print(acc)
assert acc.get()[1] > 0.96, "Achieved accuracy (%f) is lower than expected (0.96)" % acc.get()[1]

it will get:

...
INFO:root:Epoch[24] Validation-accuracy=0.959500
INFO:root:Epoch[25] Batch [2]   Speed: 1280312.58 samples/sec   accuracy=0.964833
INFO:root:Epoch[25] Batch [4]   Speed: 640161.17 samples/sec    accuracy=0.964200
INFO:root:Epoch[25] Train-accuracy=0.960800
INFO:root:Epoch[25] Time cost=0.094
INFO:root:Epoch[25] Validation-accuracy=0.932800
INFO:root:Epoch[26] Batch [2]   Speed: 640151.40 samples/sec    accuracy=0.880433
INFO:root:Epoch[26] Batch [4]   Speed: 640151.40 samples/sec    accuracy=0.605650
INFO:root:Epoch[26] Train-accuracy=0.626700
INFO:root:Epoch[26] Time cost=0.078
INFO:root:Epoch[26] Validation-accuracy=0.699800
INFO:root:Epoch[27] Batch [2]   Speed: 639741.32 samples/sec    accuracy=0.795633
INFO:root:Epoch[27] Batch [4]   Speed: 640146.52 samples/sec    accuracy=0.940700
...

here we could found that epoch 26 drops significantly.

The nadam optimizer comes with some parameters that might need tweaking for better results:

    beta1 : float, optional
        Exponential decay rate for the first moment estimates.
    beta2 : float, optional
        Exponential decay rate for the second moment estimates.
    epsilon : float, optional
        Small value to avoid division by 0.
    schedule_decay : float, optional
        Exponential decay rate for the momentum schedule

Default values:

learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, schedule_decay=0.004

Learning rate seems ignored in the update step function, as it should, so you can ignore it too.