Hey All,
I keep getting this error
“UserWarning: Gradient of Parameter lstnet0_conv0_weight
on context cpu(1) has not been updated by backward since last step
. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient”
This only happens when i use the SageMaker jobs with 1 or 2 CPU instances using the gluon Trainer API and calling it with sagemaker MXNet estimator but never happens when i use the GPU instance (i have only used one gpu instance) with the same code or training it locally in the sagemaker notebook instance.
The error occurs on the trainer.step
line but i have no idea why its happening when the local as well as the GPU training works perfectly. Is there a bug in the code and how to debug this error?
Some Additional info:
Mxnet version: 1.2
Code:
trainer = gluon.Trainer(net.collect_params(),
kvstore=store,
optimizer='adam',
optimizer_params={'learning_rate': hyperparameters['learning_rate'], 'clip_gradient': hyperparameters['clip_gradient']})
batch_size = hyperparameters['batch_size']
train_data_loader = gluon.data.DataLoader(
ts_data_train.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')
test_data_loader = gluon.data.DataLoader(
ts_data_test.train, batch_size=batch_size, shuffle=True, num_workers=2, last_batch='discard')
epochs = hyperparameters['epochs']
print("Training Start")
metric = mx.metric.RMSE()
tic = time.time()
for e in range(epochs):
metric.reset()
epoch_start_time = time.time()
for data, label in train_data_loader:
l1 = gluon.loss.L1Loss()
data = data.as_in_context(ctx[0])
label = label.as_in_context(ctx[0])
with autograd.record():
z = net(data)
loss = l1(z,label)
loss.backward()
trainer.step(batch_size)
#trainer.step(batch_size, ignore_stale_grad=True)
#trainer.allreduce_grads()
#trainer.update(False)
metric.update(label,z)