Aggregate gradients manually over n batches

As long as you follow the steps in multi-GPU tutorial, it should be exactly the same extra steps to set grad_req to 'add', forward/backward multiple times per train.step() followed by zero_grad().

Thanks @safrooze, I tried it and it’s working. This is great help. I went from batch size 32 --> 256, makes a big difference in stability of training. Again, thanks!

Hi, I was using self defined rnn class with ndarray parameters and forward function, with the help of gluon auto grad.

Is it possible to apply the aggregate gradients on this case? Since my model is not a gluon model, so grad_req could not be applied directly.

Hi @ShootingSpace, could you please post your model code? - it will help. From what I see, NDArray has the method (function) attach_grad, that has the option of grad_rec (default: write, but there is add option as well). So in principle, you should be able to perform the same thing.

Thanks for your suggestion! The problem comes out as loss become nan. Do you know if we need to add the loss together over n batches before backward()?
My gru class borrowed from the gluon tutorial

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]




    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)
def cross_entropy(yhat, y):
    return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))

def average_ce_loss(outputs, labels):
    '''Averaging the loss over the sequence'''
    assert(len(outputs) == len(labels))
    total_loss = nd.array([0.], ctx=context)
    for (output, label) in zip(outputs,labels):
        total_loss = total_loss + cross_entropy(output, label)
    return total_loss / len(outputs)
1 Like

Hi,

am not sure if what I will suggest will work 100%, but it’s easy to give it a try. If I were you, I’d change my model to use gluon directly, from here.

To your example, I think you are missing a line in your model definition, this line:

for param in self.params:
    param.attach_grad(grad_rec = 'add')

I would give it a try with the following modifications (based on the tutorial you followed).

modification in your model:

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]

        # @@@@@@@@@@@ MODIFICATION HERE @@@@@@@@@@@  
        for param in self.params:
            param.attach_grad(grad_req='add') # This tells mxnet to add the gradients
        # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)

Now in the example, the SGD function takes place in every iteration, but we need to add manually a “delay_rate”, that will do the update every N iterations, so you have enough aggregated gradients. So I am modifying SGD in the example like this:

delay_rate = 4 # This says to aggregate over 4 batch iterations before updating

# modified SGD that takes the average of the gradients
def SGD(params, lr, _delay_rate):
    for param in params:
        param[:] = param - (lr / _delay_rate) * param.grad

in the initial example, delay_rate is the default behaviour, but if you aggregate gradients over say 4 iterations, you need to divide their magnitude with 4.

Then in the training loop I would replace the line:

SGD(params, learning_rate)

with (assuming you’ve defined the class GRU with the name net somewhere

if (i/delay_rate == 0): # update every delay_rate iterations
    SGD(params, learning_rate,delay_rate)
    # Now manually zero the grads 
    for param in net.params:
         param.zero_grad()

hope this helps. By the way I am a newbie in RNNs, just started learning, so I don’t know if what I say needs modifications for your model.

Cheers

1 Like

Thanks for your suggest of implementation of the delay sgd,

You are right, I would better transfer to gluon layer.

Hi, I tried this gradients aggregation trick and everything just worked well.
But for the case that I wrote a new customer block where I used softmax() to suppress the learnable weights into range 0~1, I got an error after calling trainer.step() which read:

/softmax-inl.h:267: Check failed: req[0] != kAddTo (3 vs. 3)

Removing the softmax() line in the code can fix the error but it isn’t what I want. Is anything wrong in my code?

def forward(self,x):
      p = nd.softmax(self._rank_p.data(), axis=0)
      .......
      x = nd.dot(x, p)  
      return x

def train():

        net = net()
        for p in net.collect_params().values():
             p.grad_req = 'add'
          .......
        iter_time = 1
        for data, label in train_data:
            with autograd.record():
                output = net(data)
                loss = loss_fun(output, label)
            loss.backward()
            if iter_time % BATCH_UPDATE_PERIOD == 0:
                trainer.step(BATCH_SIZE * BATCH_UPDATE_PERIOD)
                for p in net.collect_params().values():
                    p.zero_grad()
            elif iter_time == len(train_data):
                trainer.step(BATCH_SIZE * (len(train_data) % BATCH_UPDATE_PERIOD))
                for param in net.collect_params().values():
                    param.zero_grad()
            train_loss += nd.mean(loss).asscalar()
            iter_time += 1
......

Hi @Ghostish - You are correct that nd.softmax() doesn’t support grad_req='add'. However I just tested nd.SoftmaxActivation, which basically does the same thing (but is supposed to be deprecated!) and it does support grad_req='add'. So for now, you can use nd.SoftmaxActivation(). I’ve reported the issue here.

2 Likes

Thanks a lot for the solution :grinning:

Is there any way to do this trick using Module API instead of Gluon? I find that setting gradient to zero is not that straightforward when using Module API.

Once you create a module, you can use mod._exec_group.grad_arrays to access the gradients. Each element in grad_arrays is itself a list of NDArrays. To set the NDArray to zero, just do x[:] = 0 or x*=0.

Thanks for the solution. I never thought I could do it like this. By the way, could you tell me how you find these “private APIs” ? I searched on the official documents but I got nothing relavant.

Module API is less flexible than Gluon, so you have to read the python code of mxnet to get a better understanding. I recommend cloning MXNet repo and using pycharm to traverse through the code if you choose to do so. In this particular case, this is what I did:

  1. Looked at BaseModule.fit() to see how a training loop updates parameters
  2. Looked at Module.update() to see in the details of parameter update. Noticed that gradients are accessed through self._exec_group.grad_arrays
  3. I already knew how you can set contents of a gradient to zero. Alternatively, you can also checkout the implementation of Parameter.zero_grad() to get an inspiration, which I just did and realized that it uses yet a different way to set gradient NDArrays to zero:
    ndarray.zeros_like(i, out=i)

I really appreciate for the suggestion and the help. It’s so good to have people like you in the community. :grinning:

1 Like

Hi, bro, would you like to take a look at this issue: About stale gradient ?

Thank you bro

Hi! I think it is a bit more complicated than this in practice, right?
for example, when I do what you propose on gluoncv SSD I get this: “UserWarning: Gradient of Parameter ssd0_expand_trans_bn0_moving_mean on context gpu(0) has not been updated by backward since last step. This could mean a bug in your model that made it only use a subset of the Parameters (Blocks) for this iteration. If you are intentionally only using a subset, call step with ignore_stale_grad=True to suppress this warning and skip updating of Parameters with stale gradient”

how to handle this?

this fix seems to work! About stale gradient

for p in net.collect_params().values():
    if p.grad_req != 'null':
        p.grad_req = 'add'

actually not training anything How to make gradient accumulation work in MXNet? if someone can help that will be appreciated!

Hi this error should not have to do with setting grad_req to ‘add’ value, I’ve encountered many times when I wasn’t calculating the loss properly. Please try the same code without the grad_req='add' trick to see if you get the same error.