Hey guys, I’m wondering if this is the right way to clip grad over multiple GPUs? (using data parallelism)
grads = [p.grad(ctx) for ctx in ctxs for p in model.collect_params().values()]
gluon.utils.clip_global_norm(grads, args.clipping_theta * seq_len * batch_size)
Hi @ShootingSpace,
You should be able to define gradient clipping in the Optimizer given to the Trainer object. Check out clip_gradient
argument of Optimizer. And then Trainer takes optimizer_params
as follows:
mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
kvstore='device') #for GPU
All should be fine across multiple GPUs, and clipping should occur when the magnitude of the gradient exceeds 5 in this example.