Does autograd (and training) still work if a block is called more than once by it's parent

arthurp · February 11, 2019, 6:38am

Is it safe to call the same child block more than once in a Hybrid block? Specifically will gradients still be saved correctly for each usage of the child and will those gradients be used during update?

I’m training a network by triplet loss (to learn an embedding) and so I have the same underlying network which is called three times on three different images while training on a single example instance (a triplet). The three calls to the underlying network should all use the same parameters, so those need to be shared. Can I just call the same block more than once like I’ve been doing or do I need to build separate blocks which share parameters?

Here is a simplified skeleton of what I have done so far. (It reality the underlying network is a pretrained CNN.)

class MyCNNEncoder(HybridBlock):
    def __init__(self, *args, **kwds): 
        super().__init__(*args, **kwds)
        with self.name_scope():
            self.loss = gluon.loss.TripletLoss()
            self.underlying = HSeq( 
                Dense(data.embedding_size),
                L2Norm(),
            )
    def hybrid_forward(self, F, data):
        embeddings = [self.underlying(img) for img in data.split(3, axis=1, squeeze_axis=True)]
        return (self.loss(*embeddings), F.stack(*embeddings, axis=1))

Sergey · February 13, 2019, 7:27pm

You can use same block multiple times.

Behavior of gradients is controlled by grad_req parameter - https://mxnet.incubator.apache.org/api/python/gluon/gluon.html?highlight=grad_req#mxnet.gluon.Parameter.grad_req which is write by default. So, as long as you call backward() once, it should calculate meaned gradients.

arthurp · February 13, 2019, 9:32pm

The documentation of grad_req = 'write' says “‘write’ means everytime gradient is written to grad NDArray.” To me this implies that the gradient will be overwritten each time the parameter is used. Further, the docs for grad_req = 'add' says “You need to manually call zero_grad() to clear the gradient buffer before each iteration when using this option.” So how does ‘write’ mode compute the mean gradient if it performs an overwrite?
Do you mean I always use ‘add’ if I call a single block more than once?

Sergey · February 13, 2019, 11:51pm

As I understand it, when you do a single batch (e.g. call backward() once), your gradients will be aggregated based on all runs of a block on the samples. The next call of backward() will overwrite the gradients with new values. But if you change grad_req=add, then even on subsequent calls of backward() gradients will still be accumulated, and you will have to call zero_grads() function to nulify gradients.

If you already have a working code, you can setup a simple experiment:

Put data1 through your block and notice the values of gradients
Put data2 through your block and notice the values of gradients
Put data1 and data2 through your block and notice the values of gradients. It should be different from first two cases.

arthurp · February 14, 2019, 1:43am

You are 100% correct. I clearly had an incorrect understanding of how recording and the grad{,_req} parameters work. I wrote up the test you explained to help myself understand. It is posted at: https://gist.github.com/arthurp/b2feffc5d809d06bad514cbab219a215

Topic		Replies	Views
Gluon multi-input Block Gluon	5	3783	July 9, 2019
Updating the parameters of HybridBlocks Discussion	6	1993	December 1, 2017
More examples - Gluon for hybrid block Gluon	1	717	November 7, 2017
Dynamically create nonsequential HybridBlocks inside a HybridBlock?	3	1296	July 1, 2019
Problems with Hybridize Gluon	1	541	May 14, 2018

Does autograd (and training) still work if a block is called more than once by it's parent

Related Topics