Aggregate gradients manually over n batches

feevos · January 15, 2018, 9:26am

Dear all,

I have memory limitations so I cannot use large batch size. This has the consequence that my training is unstable. Is there a way to manually aggregate gradients over n-batches of forward steps and manually update my parameters? Am pretty sure there is, so any pointers to code examples extremely appreciated!

Thanks!

PS Any recommendation to manually make training more stable for small batch size is also most welcome.

safrooze · January 15, 2018, 10:40pm

Are you using Symbolic API or Gluon/autograd?

feevos · January 15, 2018, 11:53pm

Thanks, am using Gluon.

safrooze · January 16, 2018, 5:27pm

This is very straightforward to do with Gluon. You need to set the grad_req in your network Parameter instances to 'add' and manually set the gradient to zero using zero_grad() after each Trainer.step() (see here). To set grad_req to 'add':

for p in net.collect_params().values():
    p.grad_req = 'add'

And similarly call zero_grad() on each parameter after calling Trainer.step(). Remember to modify batch_size argument of trainer.step() accordingly.

feevos · January 17, 2018, 4:40am

thank you very much!!

feevos · January 24, 2018, 9:08am

Hi @safrooze, does this work in the case of data parallelism (multiple copies of the same model in different gpus, with different data sets)? That is, I just need to add all params (in all contexts) grads to ‘add’ and then trainer.step(augmented_batch) after more than one iterations?

Cheers

safrooze · January 25, 2018, 9:01pm

As long as you follow the steps in multi-GPU tutorial, it should be exactly the same extra steps to set grad_req to 'add', forward/backward multiple times per train.step() followed by zero_grad().

feevos · January 26, 2018, 12:34am

Thanks @safrooze, I tried it and it’s working. This is great help. I went from batch size 32 --> 256, makes a big difference in stability of training. Again, thanks!

ShootingSpace · March 15, 2018, 2:53pm

Hi, I was using self defined rnn class with ndarray parameters and forward function, with the help of gluon auto grad.

Is it possible to apply the aggregate gradients on this case? Since my model is not a gluon model, so grad_req could not be applied directly.

feevos · March 16, 2018, 1:57am

Hi @ShootingSpace, could you please post your model code? - it will help. From what I see, NDArray has the method (function) attach_grad, that has the option of grad_rec (default: write, but there is add option as well). So in principle, you should be able to perform the same thing.

ShootingSpace · March 16, 2018, 9:57am

Thanks for your suggestion! The problem comes out as loss become nan. Do you know if we need to add the loss together over n batches before backward()?
My gru class borrowed from the gluon tutorial

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]




    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)

def cross_entropy(yhat, y):
    return - nd.mean(nd.sum(y * nd.log(yhat), axis=0, exclude=True))

def average_ce_loss(outputs, labels):
    '''Averaging the loss over the sequence'''
    assert(len(outputs) == len(labels))
    total_loss = nd.array([0.], ctx=context)
    for (output, label) in zip(outputs,labels):
        total_loss = total_loss + cross_entropy(output, label)
    return total_loss / len(outputs)

feevos · March 16, 2018, 10:17am

Hi,

am not sure if what I will suggest will work 100%, but it’s easy to give it a try. If I were you, I’d change my model to use gluon directly, from here.

To your example, I think you are missing a line in your model definition, this line:

for param in self.params:
    param.attach_grad(grad_rec = 'add')

I would give it a try with the following modifications (based on the tutorial you followed).

modification in your model:

class GRU():
    def __init__(self, vocab_size, num_hidden, seed, ctx=mx.cpu(0)):
        if seed:
            mx.random.seed(2018)

        num_inputs = vocab_size
        num_outputs = vocab_size
        num_hidden = num_hidden

        ########################
        #  Weights connecting the inputs to the hidden layer
        ########################
        self.Wxz = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxr = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01
        self.Wxh = nd.random_normal(shape=(num_inputs,num_hidden), ctx=ctx) * .01

        ########################
        #  Recurrent weights connecting the hidden layer across time steps
        ########################
        self.Whz = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whr = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01
        self.Whh = nd.random_normal(shape=(num_hidden,num_hidden), ctx=ctx)* .01

        ########################
        #  Bias vector for hidden layer
        ########################
        self.bz = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.br = nd.random_normal(shape=num_hidden, ctx=ctx) * .01
        self.bh = nd.random_normal(shape=num_hidden, ctx=ctx) * .01

        ########################
        # Weights to the output nodes
        ########################
        self.Why = nd.random_normal(shape=(num_hidden,num_outputs), ctx=ctx) * .01
        self.by = nd.random_normal(shape=num_outputs, ctx=ctx) * .01

        self.params = [self.Wxz, self.Wxr, self.Wxh, self.Whz, self.Whr, self.Whh,
                     self.bz, self.br, self.bh, self.Why, self.by]

        # @@@@@@@@@@@ MODIFICATION HERE @@@@@@@@@@@  
        for param in self.params:
            param.attach_grad(grad_req='add') # This tells mxnet to add the gradients
        # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

    def forward(self, inputs, h, temperature=1.0):
        outputs = []
        for X in inputs:
            z = nd.sigmoid(nd.dot(X, self.Wxz) + nd.dot(h, self.Whz) + self.bz)
            r = nd.sigmoid(nd.dot(X, self.Wxr) + nd.dot(h, self.Whr) + self.br)
            g = nd.tanh(nd.dot(X, self.Wxh) + nd.dot(r * h, self.Whh) + self.bh)
            h = z * h + (1 - z) * g

            yhat_linear = nd.dot(h, self.Why) + self.by
            yhat = softmax(yhat_linear, temperature=temperature)
            outputs.append(yhat)
        return (outputs, h)

Now in the example, the SGD function takes place in every iteration, but we need to add manually a “delay_rate”, that will do the update every N iterations, so you have enough aggregated gradients. So I am modifying SGD in the example like this:

delay_rate = 4 # This says to aggregate over 4 batch iterations before updating

# modified SGD that takes the average of the gradients
def SGD(params, lr, _delay_rate):
    for param in params:
        param[:] = param - (lr / _delay_rate) * param.grad

in the initial example, delay_rate is the default behaviour, but if you aggregate gradients over say 4 iterations, you need to divide their magnitude with 4.

Then in the training loop I would replace the line:

SGD(params, learning_rate)

with (assuming you’ve defined the class GRU with the name net somewhere

if (i/delay_rate == 0): # update every delay_rate iterations
    SGD(params, learning_rate,delay_rate)
    # Now manually zero the grads 
    for param in net.params:
         param.zero_grad()

hope this helps. By the way I am a newbie in RNNs, just started learning, so I don’t know if what I say needs modifications for your model.

Cheers

ShootingSpace · March 16, 2018, 10:59am

Thanks for your suggest of implementation of the delay sgd,

You are right, I would better transfer to gluon layer.

Ghostish · June 26, 2018, 9:28am

Hi, I tried this gradients aggregation trick and everything just worked well.
But for the case that I wrote a new customer block where I used softmax() to suppress the learnable weights into range 0~1, I got an error after calling trainer.step() which read:

/softmax-inl.h:267: Check failed: req[0] != kAddTo (3 vs. 3)

Removing the softmax() line in the code can fix the error but it isn’t what I want. Is anything wrong in my code?

def forward(self,x):
      p = nd.softmax(self._rank_p.data(), axis=0)
      .......
      x = nd.dot(x, p)  
      return x

def train():

        net = net()
        for p in net.collect_params().values():
             p.grad_req = 'add'
          .......
        iter_time = 1
        for data, label in train_data:
            with autograd.record():
                output = net(data)
                loss = loss_fun(output, label)
            loss.backward()
            if iter_time % BATCH_UPDATE_PERIOD == 0:
                trainer.step(BATCH_SIZE * BATCH_UPDATE_PERIOD)
                for p in net.collect_params().values():
                    p.zero_grad()
            elif iter_time == len(train_data):
                trainer.step(BATCH_SIZE * (len(train_data) % BATCH_UPDATE_PERIOD))
                for param in net.collect_params().values():
                    param.zero_grad()
            train_loss += nd.mean(loss).asscalar()
            iter_time += 1
......

safrooze · June 27, 2018, 1:31am

Hi @Ghostish - You are correct that nd.softmax() doesn’t support grad_req='add'. However I just tested nd.SoftmaxActivation, which basically does the same thing (but is supposed to be deprecated!) and it does support grad_req='add'. So for now, you can use nd.SoftmaxActivation(). I’ve reported the issue here.

Ghostish · June 28, 2018, 3:24am

Thanks a lot for the solution

Ghostish · November 12, 2018, 2:48am

Is there any way to do this trick using Module API instead of Gluon? I find that setting gradient to zero is not that straightforward when using Module API.

safrooze · November 19, 2018, 6:57pm

Once you create a module, you can use mod._exec_group.grad_arrays to access the gradients. Each element in grad_arrays is itself a list of NDArrays. To set the NDArray to zero, just do x[:] = 0 or x*=0.

Ghostish · November 21, 2018, 2:05am

Thanks for the solution. I never thought I could do it like this. By the way, could you tell me how you find these “private APIs” ? I searched on the official documents but I got nothing relavant.

safrooze · November 21, 2018, 3:09am

Module API is less flexible than Gluon, so you have to read the python code of mxnet to get a better understanding. I recommend cloning MXNet repo and using pycharm to traverse through the code if you choose to do so. In this particular case, this is what I did:

Looked at BaseModule.fit() to see how a training loop updates parameters
Looked at Module.update() to see in the details of parameter update. Noticed that gradients are accessed through self._exec_group.grad_arrays
I already knew how you can set contents of a gradient to zero. Alternatively, you can also checkout the implementation of Parameter.zero_grad() to get an inspiration, which I just did and realized that it uses yet a different way to set gradient NDArrays to zero:
ndarray.zeros_like(i, out=i)

Topic		Replies	Views
About stale gradient Gluon	17	3199	October 19, 2020
Gradient fetching Discussion	2	586	May 31, 2018
Implementation of weighted softmax by extending mx.autograd.Function fails	2	647	September 2, 2019
How to implement the addtion of grad in the backback-propagating,how to add extra term (which is the gradient to middle net layer output) to the network	2	586	August 18, 2018
WGAN-gp: can't compute gradient penalty with gluon? Gluon	0	407	October 15, 2020

Aggregate gradients manually over n batches

Related Topics