Does autograd (and training) still work if a block is called more than once by it's parent

gluon
how-to
python
#1

Is it safe to call the same child block more than once in a Hybrid block? Specifically will gradients still be saved correctly for each usage of the child and will those gradients be used during update?

I’m training a network by triplet loss (to learn an embedding) and so I have the same underlying network which is called three times on three different images while training on a single example instance (a triplet). The three calls to the underlying network should all use the same parameters, so those need to be shared. Can I just call the same block more than once like I’ve been doing or do I need to build separate blocks which share parameters?

Here is a simplified skeleton of what I have done so far. (It reality the underlying network is a pretrained CNN.)

class MyCNNEncoder(HybridBlock):
    def __init__(self, *args, **kwds): 
        super().__init__(*args, **kwds)
        with self.name_scope():
            self.loss = gluon.loss.TripletLoss()
            self.underlying = HSeq( 
                Dense(data.embedding_size),
                L2Norm(),
            )
    def hybrid_forward(self, F, data):
        embeddings = [self.underlying(img) for img in data.split(3, axis=1, squeeze_axis=True)]
        return (self.loss(*embeddings), F.stack(*embeddings, axis=1))
#2

You can use same block multiple times.

Behavior of gradients is controlled by grad_req parameter - https://mxnet.incubator.apache.org/api/python/gluon/gluon.html?highlight=grad_req#mxnet.gluon.Parameter.grad_req which is write by default. So, as long as you call backward() once, it should calculate meaned gradients.

#3

The documentation of grad_req = 'write' says “‘write’ means everytime gradient is written to grad NDArray.” To me this implies that the gradient will be overwritten each time the parameter is used. Further, the docs for grad_req = 'add' says “You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option.” So how does ‘write’ mode compute the mean gradient if it performs an overwrite?
Do you mean I always use ‘add’ if I call a single block more than once?

#4

As I understand it, when you do a single batch (e.g. call backward() once), your gradients will be aggregated based on all runs of a block on the samples. The next call of backward() will overwrite the gradients with new values. But if you change grad_req=add, then even on subsequent calls of backward() gradients will still be accumulated, and you will have to call zero_grads() function to nulify gradients.

If you already have a working code, you can setup a simple experiment:

  1. Put data1 through your block and notice the values of gradients
  2. Put data2 through your block and notice the values of gradients
  3. Put data1 and data2 through your block and notice the values of gradients. It should be different from first two cases.
1 Like
#5

You are 100% correct. I clearly had an incorrect understanding of how recording and the grad{,_req} parameters work. I wrote up the test you explained to help myself understand. It is posted at: https://gist.github.com/arthurp/b2feffc5d809d06bad514cbab219a215

2 Likes