About optimizer(mxnet version is 1.2.0)


#1

According to published paper, it seems that nadam method is the best method for optimize, but when I tried nadam optimizer with the mnist example, wired things occured, accuracy could decend quickly and finally dropped to 0.2. I wonder that why optimizer return an even worse result? Could such thing be normal?

Here are test results
using ctx=cpu() optimizer=‘nadam’

INFO:root:Epoch[0] Batch [100]  Speed: 22101.11 samples/sec     accuracy=0.249307
INFO:root:Epoch[0] Batch [200]  Speed: 27835.34 samples/sec     accuracy=0.244600
INFO:root:Epoch[0] Batch [300]  Speed: 27830.02 samples/sec     accuracy=0.272900
INFO:root:Epoch[0] Batch [400]  Speed: 27835.30 samples/sec     accuracy=0.254700
INFO:root:Epoch[0] Batch [500]  Speed: 26670.45 samples/sec     accuracy=0.199500
INFO:root:Epoch[0] Train-accuracy=0.194444
INFO:root:Epoch[0] Time cost=2.359
INFO:root:Epoch[0] Validation-accuracy=0.208100
INFO:root:Epoch[1] Batch [100]  Speed: 21338.35 samples/sec     accuracy=0.190990
INFO:root:Epoch[1] Batch [200]  Speed: 7712.44 samples/sec      accuracy=0.191300
INFO:root:Epoch[1] Batch [300]  Speed: 6215.06 samples/sec      accuracy=0.208300
INFO:root:Epoch[1] Batch [400]  Speed: 6810.12 samples/sec      accuracy=0.199500
INFO:root:Epoch[1] Batch [500]  Speed: 6810.13 samples/sec      accuracy=0.202300
INFO:root:Epoch[1] Train-accuracy=0.194747
INFO:root:Epoch[1] Time cost=7.764
INFO:root:Epoch[1] Validation-accuracy=0.211700
INFO:root:Epoch[2] Batch [100]  Speed: 6883.36 samples/sec      accuracy=0.194455
INFO:root:Epoch[2] Batch [200]  Speed: 6809.94 samples/sec      accuracy=0.195800
INFO:root:Epoch[2] Batch [300]  Speed: 6668.24 samples/sec      accuracy=0.201100
INFO:root:Epoch[2] Batch [400]  Speed: 6668.24 samples/sec      accuracy=0.194900
INFO:root:Epoch[2] Batch [500]  Speed: 6810.11 samples/sec      accuracy=0.197300
INFO:root:Epoch[2] Train-accuracy=0.192727
INFO:root:Epoch[2] Time cost=8.842
INFO:root:Epoch[2] Validation-accuracy=0.208300
INFO:root:Epoch[3] Batch [100]  Speed: 6809.95 samples/sec      accuracy=0.191485
INFO:root:Epoch[3] Batch [200]  Speed: 6662.77 samples/sec      accuracy=0.194300
INFO:root:Epoch[3] Batch [300]  Speed: 6810.12 samples/sec      accuracy=0.207700
INFO:root:Epoch[3] Batch [400]  Speed: 6810.28 samples/sec      accuracy=0.182800
INFO:root:Epoch[3] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.195300
INFO:root:Epoch[3] Train-accuracy=0.193131
INFO:root:Epoch[3] Time cost=8.827
INFO:root:Epoch[3] Validation-accuracy=0.209400
INFO:root:Epoch[4] Batch [100]  Speed: 6883.33 samples/sec      accuracy=0.198119
INFO:root:Epoch[4] Batch [200]  Speed: 6810.21 samples/sec      accuracy=0.198100
INFO:root:Epoch[4] Batch [300]  Speed: 6809.96 samples/sec      accuracy=0.202300
INFO:root:Epoch[4] Batch [400]  Speed: 6738.59 samples/sec      accuracy=0.201000
INFO:root:Epoch[4] Batch [500]  Speed: 6809.96 samples/sec      accuracy=0.197300
INFO:root:Epoch[4] Train-accuracy=0.194747
INFO:root:Epoch[4] Time cost=8.826
INFO:root:Epoch[4] Validation-accuracy=0.211400
INFO:root:Epoch[5] Batch [100]  Speed: 6738.41 samples/sec      accuracy=0.203168
INFO:root:Epoch[5] Batch [200]  Speed: 6688.91 samples/sec      accuracy=0.196500
INFO:root:Epoch[5] Batch [300]  Speed: 6738.44 samples/sec      accuracy=0.203700
INFO:root:Epoch[5] Batch [400]  Speed: 6810.14 samples/sec      accuracy=0.201000
INFO:root:Epoch[5] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.197100
INFO:root:Epoch[5] Train-accuracy=0.194646
INFO:root:Epoch[5] Time cost=8.868
INFO:root:Epoch[5] Validation-accuracy=0.190300
INFO:root:Epoch[6] Batch [100]  Speed: 6668.08 samples/sec      accuracy=0.202772
INFO:root:Epoch[6] Batch [200]  Speed: 6810.12 samples/sec      accuracy=0.201900
INFO:root:Epoch[6] Batch [300]  Speed: 6810.11 samples/sec      accuracy=0.201300
INFO:root:Epoch[6] Batch [400]  Speed: 6810.13 samples/sec      accuracy=0.203800
INFO:root:Epoch[6] Batch [500]  Speed: 6810.12 samples/sec      accuracy=0.199600
INFO:root:Epoch[6] Train-accuracy=0.188485
INFO:root:Epoch[6] Time cost=8.826
INFO:root:Epoch[6] Validation-accuracy=0.191500
INFO:root:Epoch[7] Batch [100]  Speed: 6809.97 samples/sec      accuracy=0.192178
INFO:root:Epoch[7] Batch [200]  Speed: 6810.14 samples/sec      accuracy=0.195700
INFO:root:Epoch[7] Batch [300]  Speed: 6810.12 samples/sec      accuracy=0.192200
INFO:root:Epoch[7] Batch [400]  Speed: 6521.27 samples/sec      accuracy=0.198400
INFO:root:Epoch[7] Batch [500]  Speed: 6599.68 samples/sec      accuracy=0.191200
INFO:root:Epoch[7] Train-accuracy=0.194242
INFO:root:Epoch[7] Time cost=8.979
INFO:root:Epoch[7] Validation-accuracy=0.193200
INFO:root:Epoch[8] Batch [100]  Speed: 6605.41 samples/sec      accuracy=0.197525
INFO:root:Epoch[8] Batch [200]  Speed: 6527.83 samples/sec      accuracy=0.197900
INFO:root:Epoch[8] Batch [300]  Speed: 6627.89 samples/sec      accuracy=0.200800
INFO:root:Epoch[8] Batch [400]  Speed: 6668.57 samples/sec      accuracy=0.202500
INFO:root:Epoch[8] Batch [500]  Speed: 6599.34 samples/sec      accuracy=0.190000
INFO:root:Epoch[8] Train-accuracy=0.195051
INFO:root:Epoch[8] Time cost=9.073
INFO:root:Epoch[8] Validation-accuracy=0.208400
INFO:root:Epoch[9] Batch [100]  Speed: 6532.14 samples/sec      accuracy=0.200891
INFO:root:Epoch[9] Batch [200]  Speed: 6466.19 samples/sec      accuracy=0.200800
INFO:root:Epoch[9] Batch [300]  Speed: 6599.30 samples/sec      accuracy=0.199600
INFO:root:Epoch[9] Batch [400]  Speed: 6738.63 samples/sec      accuracy=0.205600
INFO:root:Epoch[9] Batch [500]  Speed: 6338.11 samples/sec      accuracy=0.198600
INFO:root:Epoch[9] Train-accuracy=0.197071
INFO:root:Epoch[9] Time cost=9.123
INFO:root:Epoch[9] Validation-accuracy=0.209500

using ctx=gpu() optimizer=‘nadam’

INFO:root:Epoch[0] Batch [100]  Speed: 14225.57 samples/sec     accuracy=0.296238
INFO:root:Epoch[0] Batch [200]  Speed: 14548.91 samples/sec     accuracy=0.412700
INFO:root:Epoch[0] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.434700
INFO:root:Epoch[0] Batch [400]  Speed: 14548.88 samples/sec     accuracy=0.423600
INFO:root:Epoch[0] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.423300
INFO:root:Epoch[0] Train-accuracy=0.424747
INFO:root:Epoch[0] Time cost=4.140
INFO:root:Epoch[0] Validation-accuracy=0.418100
INFO:root:Epoch[1] Batch [100]  Speed: 14548.89 samples/sec     accuracy=0.432772
INFO:root:Epoch[1] Batch [200]  Speed: 14548.89 samples/sec     accuracy=0.425300
INFO:root:Epoch[1] Batch [300]  Speed: 14225.57 samples/sec     accuracy=0.383200
INFO:root:Epoch[1] Batch [400]  Speed: 14548.90 samples/sec     accuracy=0.319800
INFO:root:Epoch[1] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.299300
INFO:root:Epoch[1] Train-accuracy=0.300707
INFO:root:Epoch[1] Time cost=4.155
INFO:root:Epoch[1] Validation-accuracy=0.271500
INFO:root:Epoch[2] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.275149
INFO:root:Epoch[2] Batch [200]  Speed: 14548.89 samples/sec     accuracy=0.325900
INFO:root:Epoch[2] Batch [300]  Speed: 14548.89 samples/sec     accuracy=0.357300
INFO:root:Epoch[2] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.347500
INFO:root:Epoch[2] Batch [500]  Speed: 14225.57 samples/sec     accuracy=0.361400
INFO:root:Epoch[2] Train-accuracy=0.342626
INFO:root:Epoch[2] Time cost=4.140
INFO:root:Epoch[2] Validation-accuracy=0.354200
INFO:root:Epoch[3] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.357426
INFO:root:Epoch[3] Batch [200]  Speed: 14548.88 samples/sec     accuracy=0.202400
INFO:root:Epoch[3] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.206400
INFO:root:Epoch[3] Batch [400]  Speed: 14548.89 samples/sec     accuracy=0.199300
INFO:root:Epoch[3] Batch [500]  Speed: 14548.92 samples/sec     accuracy=0.192800
INFO:root:Epoch[3] Train-accuracy=0.194141
INFO:root:Epoch[3] Time cost=4.124
INFO:root:Epoch[3] Validation-accuracy=0.193400
INFO:root:Epoch[4] Batch [100]  Speed: 14548.91 samples/sec     accuracy=0.192574
INFO:root:Epoch[4] Batch [200]  Speed: 13916.31 samples/sec     accuracy=0.192300
INFO:root:Epoch[4] Batch [300]  Speed: 14548.87 samples/sec     accuracy=0.206800
INFO:root:Epoch[4] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.203000
INFO:root:Epoch[4] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.200900
INFO:root:Epoch[4] Train-accuracy=0.194444
INFO:root:Epoch[4] Time cost=4.187
INFO:root:Epoch[4] Validation-accuracy=0.197000
INFO:root:Epoch[5] Batch [100]  Speed: 14548.88 samples/sec     accuracy=0.199604
INFO:root:Epoch[5] Batch [200]  Speed: 14225.58 samples/sec     accuracy=0.195100
INFO:root:Epoch[5] Batch [300]  Speed: 14225.59 samples/sec     accuracy=0.202500
INFO:root:Epoch[5] Batch [400]  Speed: 14548.88 samples/sec     accuracy=0.204800
INFO:root:Epoch[5] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.202400
INFO:root:Epoch[5] Train-accuracy=0.197071
INFO:root:Epoch[5] Time cost=4.155
INFO:root:Epoch[5] Validation-accuracy=0.206700
INFO:root:Epoch[6] Batch [100]  Speed: 14225.60 samples/sec     accuracy=0.196733
INFO:root:Epoch[6] Batch [200]  Speed: 14225.57 samples/sec     accuracy=0.195900
INFO:root:Epoch[6] Batch [300]  Speed: 14548.88 samples/sec     accuracy=0.205500
INFO:root:Epoch[6] Batch [400]  Speed: 14548.91 samples/sec     accuracy=0.206000
INFO:root:Epoch[6] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.204300
INFO:root:Epoch[6] Train-accuracy=0.194444
INFO:root:Epoch[6] Time cost=4.171
INFO:root:Epoch[6] Validation-accuracy=0.208300
INFO:root:Epoch[7] Batch [100]  Speed: 14225.60 samples/sec     accuracy=0.202772
INFO:root:Epoch[7] Batch [200]  Speed: 13620.24 samples/sec     accuracy=0.201000
INFO:root:Epoch[7] Batch [300]  Speed: 14225.58 samples/sec     accuracy=0.207600
INFO:root:Epoch[7] Batch [400]  Speed: 13916.32 samples/sec     accuracy=0.205500
INFO:root:Epoch[7] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.200200
INFO:root:Epoch[7] Train-accuracy=0.192323
INFO:root:Epoch[7] Time cost=4.249
INFO:root:Epoch[7] Validation-accuracy=0.191600
INFO:root:Epoch[8] Batch [100]  Speed: 14548.90 samples/sec     accuracy=0.198713
INFO:root:Epoch[8] Batch [200]  Speed: 14225.56 samples/sec     accuracy=0.196500
INFO:root:Epoch[8] Batch [300]  Speed: 13916.36 samples/sec     accuracy=0.204200
INFO:root:Epoch[8] Batch [400]  Speed: 14548.87 samples/sec     accuracy=0.201800
INFO:root:Epoch[8] Batch [500]  Speed: 14225.58 samples/sec     accuracy=0.197700
INFO:root:Epoch[8] Train-accuracy=0.194949
INFO:root:Epoch[8] Time cost=4.202
INFO:root:Epoch[8] Validation-accuracy=0.208800
INFO:root:Epoch[9] Batch [100]  Speed: 14548.89 samples/sec     accuracy=0.202376
INFO:root:Epoch[9] Batch [200]  Speed: 14225.59 samples/sec     accuracy=0.201400
INFO:root:Epoch[9] Batch [300]  Speed: 14548.89 samples/sec     accuracy=0.205800
INFO:root:Epoch[9] Batch [400]  Speed: 14225.57 samples/sec     accuracy=0.203800
INFO:root:Epoch[9] Batch [500]  Speed: 14548.89 samples/sec     accuracy=0.200600
INFO:root:Epoch[9] Train-accuracy=0.193939
INFO:root:Epoch[9] Time cost=4.171
INFO:root:Epoch[9] Validation-accuracy=0.206100

several days ago I found that thing, but I recognized such thing as a incorrect learning rate, in this example, just delete learning rate, and things will be better.
but now, such thing coming back.
with batch_size=10000,ctx=mx.gpu(),optimizer=‘nadam’,using default learning rate

......
INFO:root:Epoch[25] Validation-accuracy=0.988000
INFO:root:Epoch[26] Train-accuracy=0.991017
INFO:root:Epoch[26] Time cost=0.984
INFO:root:Epoch[26] Validation-accuracy=0.986100
INFO:root:Epoch[27] Train-accuracy=0.987683
INFO:root:Epoch[27] Time cost=0.969
INFO:root:Epoch[27] Validation-accuracy=0.976800
INFO:root:Epoch[28] Train-accuracy=0.786283
INFO:root:Epoch[28] Time cost=0.984
INFO:root:Epoch[28] Validation-accuracy=0.106700
INFO:root:Epoch[29] Train-accuracy=0.103833
INFO:root:Epoch[29] Time cost=0.984
INFO:root:Epoch[29] Validation-accuracy=0.089200
INFO:root:Epoch[30] Train-accuracy=0.097083
INFO:root:Epoch[30] Time cost=0.984
INFO:root:Epoch[30] Validation-accuracy=0.100900
INFO:root:Epoch[31] Train-accuracy=0.102583
INFO:root:Epoch[31] Time cost=0.984
......

#2

@Neutron can you share your entire training code so I can try to reproduce and provide you with tips to solve your issue?

It does look like your learning rate could be too high and the training is diverging.


#3

I’m so sorry to replied so late. and maybe due to i change the mxnet from mxnet-cu91 to mxnet-cu92(with newest cuda v9.2 installed) i cannot reproduce the result.
but use the code below could easily find out that nadam optimizer did not stable.

import mxnet as mx
mnist = mx.test_utils.get_mnist()

# Fix the seed
mx.random.seed(1)

# Set the compute context, GPU is available otherwise CPU
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()

batch_size = 10000
train_iter = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)

data = mx.sym.var('data')
# Flatten the data from 4-D shape into 2-D (batch_size, num_channel*width*height)
data = mx.sym.flatten(data=data)

# The first fully-connected layer and the corresponding activation function
fc1  = mx.sym.FullyConnected(data=data, num_hidden=128)
act1 = mx.sym.Activation(data=fc1, act_type="relu")

# The second fully-connected layer and the corresponding activation function
fc2  = mx.sym.FullyConnected(data=act1, num_hidden = 64)
act2 = mx.sym.Activation(data=fc2, act_type="relu")

# MNIST has 10 classes
fc3  = mx.sym.FullyConnected(data=act2, num_hidden=10)
# Softmax with cross entropy loss
mlp  = mx.sym.SoftmaxOutput(data=fc3, name='softmax')

import logging
logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
# create a trainable module on compute context
mlp_model = mx.mod.Module(symbol=mlp, context=ctx)
mlp_model.fit(train_iter,  # train data
              eval_data=val_iter,  # validation data
              optimizer='nadam',  # use nadam to train
              eval_metric='acc',  # report accuracy during training
              batch_end_callback = mx.callback.Speedometer(batch_size, 2), # output progress for each 20000 data batches
              num_epoch=1000)  # train for at most 1000 dataset passes

test_iter = mx.io.NDArrayIter(mnist['test_data'], None, batch_size)
prob = mlp_model.predict(test_iter)
assert prob.shape == (10000, 10)

test_iter = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)
# predict accuracy of mlp
acc = mx.metric.Accuracy()
mlp_model.score(test_iter, acc)
print(acc)
assert acc.get()[1] > 0.96, "Achieved accuracy (%f) is lower than expected (0.96)" % acc.get()[1]

it will get:

...
INFO:root:Epoch[24] Validation-accuracy=0.959500
INFO:root:Epoch[25] Batch [2]   Speed: 1280312.58 samples/sec   accuracy=0.964833
INFO:root:Epoch[25] Batch [4]   Speed: 640161.17 samples/sec    accuracy=0.964200
INFO:root:Epoch[25] Train-accuracy=0.960800
INFO:root:Epoch[25] Time cost=0.094
INFO:root:Epoch[25] Validation-accuracy=0.932800
INFO:root:Epoch[26] Batch [2]   Speed: 640151.40 samples/sec    accuracy=0.880433
INFO:root:Epoch[26] Batch [4]   Speed: 640151.40 samples/sec    accuracy=0.605650
INFO:root:Epoch[26] Train-accuracy=0.626700
INFO:root:Epoch[26] Time cost=0.078
INFO:root:Epoch[26] Validation-accuracy=0.699800
INFO:root:Epoch[27] Batch [2]   Speed: 639741.32 samples/sec    accuracy=0.795633
INFO:root:Epoch[27] Batch [4]   Speed: 640146.52 samples/sec    accuracy=0.940700
...

here we could found that epoch 26 drops significantly.


#4

The nadam optimizer comes with some parameters that might need tweaking for better results:

    beta1 : float, optional
        Exponential decay rate for the first moment estimates.
    beta2 : float, optional
        Exponential decay rate for the second moment estimates.
    epsilon : float, optional
        Small value to avoid division by 0.
    schedule_decay : float, optional
        Exponential decay rate for the momentum schedule

Default values:

learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, schedule_decay=0.004

Learning rate seems ignored in the update step function, as it should, so you can ignore it too.