Aggregate gradients manually over n batches

Dear all,

I have memory limitations so I cannot use large batch size. This has the consequence that my training is unstable. Is there a way to manually aggregate gradients over n-batches of forward steps and manually update my parameters? Am pretty sure there is, so any pointers to code examples extremely appreciated!

Thanks!

PS Any recommendation to manually make training more stable for small batch size is also most welcome.

Are you using Symbolic API or Gluon/autograd?

Thanks, am using Gluon.

This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':

for p in net.collect_params().values():
    p.grad_req = 'add'

And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly.

5 Likes

thank you very much!!

Hi @safrooze, does this work in the case of data parallelism (multiple copies of the same model in different gpus, with different data sets)? That is, I just need to add all params (in all contexts) grads to ‘add’ and then trainer.step(augmented_batch) after more than one iterations?

Cheers

As long as you follow the steps in multi-GPU tutorial, it should be exactly the same extra steps to set grad_req to 'add', forward/backward multiple times per train.step() followed by zero_grad().

Thanks @safrooze, I tried it and it’s working. This is great help. I went from batch size 32 --> 256, makes a big difference in stability of training. Again, thanks!

Hi, I was using self defined rnn class with ndarray parameters and forward function, with the help of gluon auto grad.

Is it possible to apply the aggregate gradients on this case? Since my model is not a gluon model, so grad_req could not be applied directly.

Hi @ShootingSpace, could you please post your model code? - it will help. From what I see, NDArray has the method (function) attach_grad, that has the option of grad_rec (default: write, but there is add option as well). So in principle, you should be able to perform the same thing.

Thanks for your suggestion! The problem comes out as loss become nan. Do you know if we need to add the loss together over n batches before backward()?
My gru class borrowed from the gluon tutorial

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]




    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)
def cross_entropy(yhat, y):
    return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))

def average_ce_loss(outputs, labels):
    '''Averaging the loss over the sequence'''
    assert(len(outputs) == len(labels))
    total_loss = nd.array([0.], ctx=context)
    for (output, label) in zip(outputs,labels):
        total_loss = total_loss + cross_entropy(output, label)
    return total_loss / len(outputs)
1 Like

Hi,

am not sure if what I will suggest will work 100%, but it’s easy to give it a try. If I were you, I’d change my model to use gluon directly, from here.

To your example, I think you are missing a line in your model definition, this line:

for param in self.params:
    param.attach_grad(grad_rec = 'add')

I would give it a try with the following modifications (based on the tutorial you followed).

modification in your model:

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]

        # @@@@@@@@@@@ MODIFICATION HERE @@@@@@@@@@@  
        for param in self.params:
            param.attach_grad(grad_req='add') # This tells mxnet to add the gradients
        # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)

Now in the example, the SGD function takes place in every iteration, but we need to add manually a “delay_rate”, that will do the update every N iterations, so you have enough aggregated gradients. So I am modifying SGD in the example like this:

delay_rate = 4 # This says to aggregate over 4 batch iterations before updating

# modified SGD that takes the average of the gradients
def SGD(params, lr, _delay_rate):
    for param in params:
        param[:] = param - (lr / _delay_rate) * param.grad

in the initial example, delay_rate is the default behaviour, but if you aggregate gradients over say 4 iterations, you need to divide their magnitude with 4.

Then in the training loop I would replace the line:

SGD(params, learning_rate)

with (assuming you’ve defined the class GRU with the name net somewhere

if (i/delay_rate == 0): # update every delay_rate iterations
    SGD(params, learning_rate,delay_rate)
    # Now manually zero the grads 
    for param in net.params:
         param.zero_grad()

hope this helps. By the way I am a newbie in RNNs, just started learning, so I don’t know if what I say needs modifications for your model.

Cheers

1 Like

Thanks for your suggest of implementation of the delay sgd,

You are right, I would better transfer to gluon layer.

Hi, I tried this gradients aggregation trick and everything just worked well.
But for the case that I wrote a new customer block where I used softmax() to suppress the learnable weights into range 0~1, I got an error after calling trainer.step() which read:

/softmax-inl.h:267: Check failed: req[0] != kAddTo (3 vs. 3)

Removing the softmax() line in the code can fix the error but it isn’t what I want. Is anything wrong in my code?

def forward(self,x):
      p = nd.softmax(self._rank_p.data(), axis=0)
      .......
      x = nd.dot(x, p)  
      return x

def train():

        net = net()
        for p in net.collect_params().values():
             p.grad_req = 'add'
          .......
        iter_time = 1
        for data, label in train_data:
            with autograd.record():
                output = net(data)
                loss = loss_fun(output, label)
            loss.backward()
            if iter_time % BATCH_UPDATE_PERIOD == 0:
                trainer.step(BATCH_SIZE * BATCH_UPDATE_PERIOD)
                for p in net.collect_params().values():
                    p.zero_grad()
            elif iter_time == len(train_data):
                trainer.step(BATCH_SIZE * (len(train_data) % BATCH_UPDATE_PERIOD))
                for param in net.collect_params().values():
                    param.zero_grad()
            train_loss += nd.mean(loss).asscalar()
            iter_time += 1
......

Hi @Ghostish - You are correct that nd.softmax() doesn’t support grad_req='add'. However I just tested nd.SoftmaxActivation, which basically does the same thing (but is supposed to be deprecated!) and it does support grad_req='add'. So for now, you can use nd.SoftmaxActivation(). I’ve reported the issue here.

2 Likes

Thanks a lot for the solution :grinning:

Is there any way to do this trick using Module API instead of Gluon? I find that setting gradient to zero is not that straightforward when using Module API.

Once you create a module, you can use mod._exec_group.grad_arrays to access the gradients. Each element in grad_arrays is itself a list of NDArrays. To set the NDArray to zero, just do x[:] = 0 or x*=0.

Thanks for the solution. I never thought I could do it like this. By the way, could you tell me how you find these “private APIs” ? I searched on the official documents but I got nothing relavant.

Module API is less flexible than Gluon, so you have to read the python code of mxnet to get a better understanding. I recommend cloning MXNet repo and using pycharm to traverse through the code if you choose to do so. In this particular case, this is what I did:

  1. Looked at BaseModule.fit() to see how a training loop updates parameters
  2. Looked at Module.update() to see in the details of parameter update. Noticed that gradients are accessed through self._exec_group.grad_arrays
  3. I already knew how you can set contents of a gradient to zero. Alternatively, you can also checkout the implementation of Parameter.zero_grad() to get an inspiration, which I just did and realized that it uses yet a different way to set gradient NDArrays to zero:
    ndarray.zeros_like(i, out=i)