SageMaker CPU Training: Gradient of Parameter `lstnet0_conv0_weight` on context cpu(1) has not been updated by backward since last `step`

Nell · April 2, 2019, 7:13pm

Hey All,

I keep getting this error
“UserWarning: Gradient of Parameter lstnet0_conv0_weight on context cpu(1) has not been updated by backward since last step. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient”

This only happens when i use the SageMaker jobs with 1 or 2 CPU instances using the gluon Trainer API and calling it with sagemaker MXNet estimator but never happens when i use the GPU instance (i have only used one gpu instance) with the same code or training it locally in the sagemaker notebook instance.

The error occurs on the trainer.step line but i have no idea why its happening when the local as well as the GPU training works perfectly. Is there a bug in the code and how to debug this error?

Some Additional info:
Mxnet version: 1.2

Code:

    trainer = gluon.Trainer(net.collect_params(),
        kvstore=store,
        optimizer='adam',
        optimizer_params={'learning_rate': hyperparameters['learning_rate'], 'clip_gradient': hyperparameters['clip_gradient']})

    batch_size = hyperparameters['batch_size']
    train_data_loader = gluon.data.DataLoader(
        ts_data_train.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')
    test_data_loader = gluon.data.DataLoader(
        ts_data_test.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')

    epochs = hyperparameters['epochs']
    print("Training Start")
    metric = mx.metric.RMSE()
    tic = time.time()
    for e in range(epochs):
        metric.reset()
        epoch_start_time = time.time()
        for data, label in train_data_loader:
            l1 = gluon.loss.L1Loss()
            data = data.as_in_context(ctx[0])
            label = label.as_in_context(ctx[0])
            with autograd.record():
                z = net(data)
                loss = l1(z,label)
                loss.backward()
            trainer.step(batch_size)
            #trainer.step(batch_size, ignore_stale_grad=True)
            #trainer.allreduce_grads()
            #trainer.update(False)
            metric.update(label,z)

thomelane · April 2, 2019, 9:21pm

Hi @Nell,

Can you provide the code where the context (ctx) is set?

I see that your context is a list becuase you’re using ctx[0]. But for CPU I’d expect ctx = mx.cpu(). Make sure you don’t have a context list with each of the cores (e.g. [mx.cpu(0), mx.cpu(1)]), but instead only mx.cpu(). You should install the MKL version of MXNet (installed with pip install mxnet-mkl) to use multiple CPU cores, which happens even when the context is set as mx.cpu().

Nell · April 2, 2019, 9:24pm

Hey Thom,

Here is the code that goes right before the code above

ctx = [mx.cpu(i) for i in range(num_cpus)]
    if num_gpus > 0:
        ctx = ctx = [mx.gpu(i) for i in range(num_gpus)]
    print('Running on {}'.format(ctx))
    print('Hosts {}'.format(hosts))
    print('Current Host {}'.format(current_host))

    net = LSTNet(
        num_series=ts_data_train.num_series,
        conv_hid=hyperparameters['conv_hid'],
        gru_hid=hyperparameters['gru_hid'],
        skip_gru_hid=hyperparameters['skip_gru_hid'],
        skip=hyperparameters['skip'],
        ar_window=hyperparameters['ar_window'])

    net.initialize(init=mx.init.Xavier(factor_type="in", magnitude=2.34), ctx=ctx)

    kvstore = 'local'
    if len(hosts) == 1:
        kvstore = 'device' if num_gpus > 0 else 'local'
    else:
        kvstore = 'dist_device_sync' if num_gpus > 0 else 'dist_sync'

    print('kvstore {}'.format(kvstore))
    store = kv.create(kvstore)
    print("Total number of workers: %d" % store.num_workers)
    print("This worker's rank: %d" % store.rank)

thomelane · April 2, 2019, 9:33pm

So this is where your issue is I think. You don’t need num_cpus. Just set ctx = mx.cpu() to use the CPUs. But even better than that, try this…

ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()

You didn’t have this issue on GPU instances because you’ve set the context accordingly for that case, but the line above does two in one, uses the GPU if avaliable and if not uses the CPU (all cores if have mxnet-mkl).

Nell · April 2, 2019, 10:34pm

That worked!!!

Thanks Thom! you are awesome!

Topic		Replies	Views
Fine-tuning error "gradient has not been updated by backward since last step" Gluon	1	1433	September 1, 2019
Lower accuracy on Cifar10 with multi-gpu implementation	5	599	August 23, 2018
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	763	November 14, 2018
Retrieve gradient with respect to attention map in Gluon Gluon	7	2034	August 2, 2019
Distributed Training / Model Parallelism with sparse embeddings in Gluon Gluon	2	536	June 19, 2019

SageMaker CPU Training: Gradient of Parameter `lstnet0_conv0_weight` on context cpu(1) has not been updated by backward since last `step`

Related Topics