As long as you follow the steps in multi-GPU tutorial, it should be exactly the same extra steps to set `grad_req`

to `'add'`

, forward/backward multiple times per `train.step()`

followed by `zero_grad()`

.

# Aggregate gradients manually over n batches

Thanks @safrooze, I tried it and it’s working. This is great help. I went from batch size 32 --> 256, makes a big difference in stability of training. Again, thanks!

Hi, I was using self defined rnn class with ndarray parameters and forward function, with the help of gluon auto grad.

Is it possible to apply the aggregate gradients on this case? Since my model is not a gluon model, so grad_req could not be applied directly.

Hi @ShootingSpace, could you please post your model code? - it will help. From what I see, NDArray has the method (function) `attach_grad`

, that has the option of `grad_rec`

(default: `write`

, but there is `add`

option as well). So in principle, you should be able to perform the same thing.

Thanks for your suggestion! The problem comes out as loss become `nan`

. Do you know if we need to add the loss together over n batches before backward()?

My gru class borrowed from the gluon tutorial

```
class GRU():
def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
if seed:
mx.random.seed(2018)
num_inputs = vocab_size
num_outputs = vocab_size
num_hidden = num_hidden
########################
# Weights connecting the inputs to the hidden layer
########################
self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
# Recurrent weights connecting the hidden layer across time steps
########################
self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
# Bias vector for hidden layer
########################
self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
self.bz, self.br, self.bh, self.Why, self.by]
def forward(self, inputs, h, temperature=1.0):
outputs = []
for X in inputs:
z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
h = z * h + (1 - z) * g
yhat_linear = nd.dot(h, self.Why) + self.by
yhat = softmax(yhat_linear, temperature=temperature)
outputs.append(yhat)
return (outputs, h)
```

```
def cross_entropy(yhat, y):
return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))
def average_ce_loss(outputs, labels):
'''Averaging the loss over the sequence'''
assert(len(outputs) == len(labels))
total_loss = nd.array([0.], ctx=context)
for (output, label) in zip(outputs,labels):
total_loss = total_loss + cross_entropy(output, label)
return total_loss / len(outputs)
```

Hi,

am not sure if what I will suggest will work 100%, but it’s easy to give it a try. If I were you, I’d change my model to use gluon directly, from here.

To your example, I think you are missing a line in your model definition, this line:

```
for param in self.params:
param.attach_grad(grad_rec = 'add')
```

I would give it a try with the following modifications (based on the tutorial you followed).

modification in your model:

```
class GRU():
def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
if seed:
mx.random.seed(2018)
num_inputs = vocab_size
num_outputs = vocab_size
num_hidden = num_hidden
########################
# Weights connecting the inputs to the hidden layer
########################
self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
########################
# Recurrent weights connecting the hidden layer across time steps
########################
self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
########################
# Bias vector for hidden layer
########################
self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
########################
# Weights to the output nodes
########################
self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01
self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
self.bz, self.br, self.bh, self.Why, self.by]
# @@@@@@@@@@@ MODIFICATION HERE @@@@@@@@@@@
for param in self.params:
param.attach_grad(grad_req='add') # This tells mxnet to add the gradients
# @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
def forward(self, inputs, h, temperature=1.0):
outputs = []
for X in inputs:
z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
h = z * h + (1 - z) * g
yhat_linear = nd.dot(h, self.Why) + self.by
yhat = softmax(yhat_linear, temperature=temperature)
outputs.append(yhat)
return (outputs, h)
```

Now in the example, the SGD function takes place in every iteration, but we need to add manually a “delay_rate”, that will do the update every N iterations, so you have enough aggregated gradients. So I am modifying `SGD`

in the example like this:

```
delay_rate = 4 # This says to aggregate over 4 batch iterations before updating
# modified SGD that takes the average of the gradients
def SGD(params, lr, _delay_rate):
for param in params:
param[:] = param - (lr / _delay_rate) * param.grad
```

in the initial example, delay_rate is the default behaviour, but if you aggregate gradients over say 4 iterations, you need to divide their magnitude with 4.

Then in the training loop I would replace the line:

```
SGD(params, learning_rate)
```

with (assuming you’ve defined the class GRU with the name net somewhere

```
if (i/delay_rate == 0): # update every delay_rate iterations
SGD(params, learning_rate,delay_rate)
# Now manually zero the grads
for param in net.params:
param.zero_grad()
```

hope this helps. By the way I am a newbie in RNNs, just started learning, so I don’t know if what I say needs modifications for your model.

Cheers

Thanks for your suggest of implementation of the delay sgd,

You are right, I would better transfer to gluon layer.

Hi, I tried this gradients aggregation trick and everything just worked well.

But for the case that I wrote a new customer block where I used `softmax()`

to suppress the learnable weights into range 0~1, I got an error after calling `trainer.step()`

which read:

/softmax-inl.h:267: Check failed: req[0] != kAddTo (3 vs. 3)

Removing the `softmax()`

line in the code can fix the error but it isn’t what I want. Is anything wrong in my code?

```
def forward(self,x):
p = nd.softmax(self._rank_p.data(), axis=0)
.......
x = nd.dot(x, p)
return x
def train():
net = net()
for p in net.collect_params().values():
p.grad_req = 'add'
.......
iter_time = 1
for data, label in train_data:
with autograd.record():
output = net(data)
loss = loss_fun(output, label)
loss.backward()
if iter_time % BATCH_UPDATE_PERIOD == 0:
trainer.step(BATCH_SIZE * BATCH_UPDATE_PERIOD)
for p in net.collect_params().values():
p.zero_grad()
elif iter_time == len(train_data):
trainer.step(BATCH_SIZE * (len(train_data) % BATCH_UPDATE_PERIOD))
for param in net.collect_params().values():
param.zero_grad()
train_loss += nd.mean(loss).asscalar()
iter_time += 1
......
```

Hi @Ghostish - You are correct that `nd.softmax()`

doesn’t support `grad_req='add'`

. However I just tested `nd.SoftmaxActivation`

, which basically does the same thing (but is supposed to be deprecated!) and it does support `grad_req='add'`

. So for now, you can use `nd.SoftmaxActivation()`

. I’ve reported the issue here.

Thanks a lot for the solution

Is there any way to do this trick using `Module API`

instead of Gluon? I find that setting gradient to zero is not that straightforward when using `Module API`

.

Once you create a module, you can use `mod._exec_group.grad_arrays`

to access the gradients. Each element in grad_arrays is itself a list of NDArrays. To set the NDArray to zero, just do `x[:] = 0`

or `x*=0`

.

Thanks for the solution. I never thought I could do it like this. By the way, could you tell me how you find these “private APIs” ? I searched on the official documents but I got nothing relavant.

Module API is less flexible than Gluon, so you have to read the python code of mxnet to get a better understanding. I recommend cloning MXNet repo and using pycharm to traverse through the code if you choose to do so. In this particular case, this is what I did:

- Looked at
`BaseModule.fit()`

to see how a training loop updates parameters - Looked at
`Module.update()`

to see in the details of parameter update. Noticed that gradients are accessed through`self._exec_group.grad_arrays`

- I already knew how you can set contents of a gradient to zero. Alternatively, you can also checkout the implementation of
`Parameter.zero_grad()`

to get an inspiration, which I just did and realized that it uses yet a different way to set gradient NDArrays to zero:

`ndarray.zeros_like(i, out=i)`

I really appreciate for the suggestion and the help. It’s so good to have people like you in the community.

Hi! I think it is a bit more complicated than this in practice, right?

for example, when I do what you propose on gluoncv SSD I get this: “UserWarning: Gradient of Parameter `ssd0_expand_trans_bn0_moving_mean`

on context gpu(0) has not been updated by backward since last `step`

. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient”

how to handle this?

this fix seems to work! About stale gradient

```
for p in net.collect_params().values():
if p.grad_req != 'null':
p.grad_req = 'add'
```

actually not training anything How to make gradient accumulation work in MXNet? if someone can help that will be appreciated!

Hi this error should not have to do with setting grad_req to ‘add’ value, I’ve encountered many times when I wasn’t calculating the loss properly. Please try the same code without the `grad_req='add'`

trick to see if you get the same error.