Multiple GPUs RNNs grad clip

Hey guys, I’m wondering if this is the right way to clip grad over multiple GPUs? (using data parallelism)

grads = [p.grad(ctx) for ctx in ctxs for p in model.collect_params().values()]
gluon.utils.clip_global_norm(grads, args.clipping_theta * seq_len * batch_size)

Hi @ShootingSpace,

You should be able to define gradient clipping in the Optimizer given to the Trainer object. Check out clip_gradient argument of Optimizer. And then Trainer takes optimizer_params as follows:

mxnet.gluon.Trainer(net.collect_params(), optimizer='sgd',
                    optimizer_params={'learning_rate': 0.1, 'clip_gradient':5},
                    kvstore='device') #for GPU

All should be fine across multiple GPUs, and clipping should occur when the magnitude of the gradient exceeds 5 in this example.