Lower accuracy on Cifar10 with multi-gpu implementation


#1

Does anyone know what’s the problem with following code? Does it have anything to do with recreate gluon.Trainer every batch?

from mxnet.gluon import utils as gutils
import datetime
import sys
sys.path.append('..')
import utils

def multi_gpu_train_batch(net,data,label,batch_size,lr,momentum,weight_decay,contexts): 

    trainer=gluon.Trainer(net.collect_params(),'sgd',{'learning_rate':lr,'momentum':momentum,'wd':weight_decay})
    data_list=gutils.split_and_load(data,contexts)
    label_list=gutils.split_and_load(label,contexts)
    ls=[]

    with autograd.record():
        i=0
        for each_data,each_label in zip(data_list,label_list):
            output=net(each_data)
            loss=criterion(output,each_label)
            ls.append(loss)
            if i==0:
                outputs=output
                losses=loss
            else:
                outputs=nd.concat(outputs,output.as_in_context(outputs.context),dim=0)
                losses=nd.concat(losses,loss.as_in_context(losses.context),dim=0)
            i=i+1

    for l in ls:
         l.backward()
    trainer.step(batch_size)

    return outputs,losses

def multi_gpu_train(net,train_data,valid_data,num_epochs,batch_size, lr,momentum,weight_decay,contexts,lr_period,lr_decay):

    prev_time=datetime.datetime.now()
    i=0
    length=len(lr_period)
    for epoch in range(num_epochs):
         train_loss=0.0
         train_acc=0.0
         if epoch>0 and i<length and epoch==lr_period[i]:
              lr=lr*lr_decay
               i=i+1
    for data,label in train_data:
        output,loss=multi_gpu_train_batch(net,data,label,batch_size,lr,momentum,weight_decay,contexts)
    nd.waitall()
    train_loss += nd.mean(loss).asscalar()
    train_acc += utils.accuracy(output.as_in_context(mx.cpu(0)),label)                                    
    cur_time=datetime.datetime.now()
    h,remainder = divmod((cur_time-prev_time).seconds,3600)
    m,s=divmod(remainder,60)
    time_str=" Time : %02dhour %02dmin %02dsec" % (h,m,s)

    if valid_data is not None:
         valid_acc = utils.evaluate_accuracy(valid_data,net,contexts)
         epoch_str= ("epoch:%d ,loss:%f ,train_acc:%f , valid_acc:%f" % 
                (epoch,train_loss/len(train_data),train_acc/len(train_data),valid_acc))
    else:
         epoch_str= ("epoch:%d ,loss:%f ,train_acc:%f" % 
                (epoch,train_loss/len(train_data),train_acc/len(train_data)))

    prev_time=cur_time
    print(time_str+epoch_str+' learning_rate:'+str(lr))

I get around 6% boost in accuracy on Cifar10 by using a very simple train function:

def train(net, train_data, valid_data, num_epochs, lr, wd, ctx, lr_period,
      lr_decay):
     trainer = gluon.Trainer(net.collect_params(), 'sgd',
                        {'learning_rate': lr, 'momentum': 0.9, 'wd': wd})
     prev_time = datetime.datetime.now()
     for epoch in range(num_epochs):
         if epoch > 0 and epoch % lr_period == 0:
             trainer.set_learning_rate(trainer.learning_rate * lr_decay)
         for X, y in train_data:
             y = y.astype('float32').as_in_context(ctx)
             with autograd.record():
                 y_hat = net(X.as_in_context(ctx))
                 l = loss(y_hat, y)
             l.backward()
             trainer.step(batch_size)

#2

Hi @JWarlock,

Could you have a reformat of your code since it’s a little tricky to work through with no indents. I can’t tell what’s going on with i=i+1 for example. Use ``` before and after your code block.

Could you also provide the hyperparameters you’ve been using for theses tests, and highlight any differences between the two if any (e.g. batch size and learning rate).

Cheers,

Thom


#3

Sorry, I’m not quite familiar with formatting here. I’ve reformated the code.
And two tests shared the same hyperparameters. Simply changing the training function results in accuracy boost. That’s why I’m confused.

I’ve listed hyperparams as bellows:

batch_size = 128
net = DenseNet(growthRate=12, depth=100, reduction=0.5, bottleneck=True, nClasses=10)
loss_f = gluon.loss.SoftmaxCrossEntropyLoss()
num_epochs = 255
learning_rate = 0.1
momentum=0.9
weight_decay = 1e-4
lr_period = [150, 225]
lr_decay=0.1

#4

I’m sorry for my absence during last few days.
Also, if you have any other questions, just ask.


#5

Eyeballing your code, if you recreate the trainer on every batch you negate the use of momentum in your optimization. As the momentum is accumulated between batches.

looking at your code for example you might be interested to learn about the built-in learning rate scheduler:


#6

Aha!
if I recreate the trainer on every batch, I’m actually using vanilla sgd rather than sgd with momentum.
So I just need to create a trainer in the multi_gpu_train function and pass the trainer into
train_batch function.
Thank you! It really helps me a lot