Gradients for Embedding layers in Gluon


#1

I am working with Gluon models from users (ie, I’m not defining the architecture myself) that potentially contain word embeddings. I need to compute the gradient of the loss function w.r.t. the inputs. Now, I know that embeddings are discrete, thus non-differentiable. Is it possible to compute the gradients w.r.t. the embeddings vectors themselves (ie, the outputs of the Embedding layer)? Any help is much appreciated.


#2

Here is an example:

net = gluon.nn.Embedding(1000, 50)
net.initialize()

x = nd.cast(nd.clip(nd.random_uniform(0, 1000, shape=(100,)), 0, 999), 'int32')
with autograd.record():
    emb = net(x)  # emb.shape=(100, 50)
    out = nd.mean(emb)  # replace nd.mean with some loss calculation
emb.backward()
emb_grad = net.weight.grad()

#3

@safrooze, thank you for your example! It works well! I do have a follow-up question. With your solution, the shape of the obtained gradient matches that of the embedding (ie, (vocab_size, embedding_dims), but the batch size of the input disappears (I suppose gradients are summed over inputs in this case). Is there a way to get these same gradients w.r.t. each input sample, resulting in something of shape (batch_size, vocab_size, embedding_dims)? Thanks again!


#4

The memory required for collecting gradients of each parameter is allocated during network initialization and it’s size is equal to the size of the parameter. If you want to get separate gradients for different outputs, you’d have to do multiple backward calls and copy the gradients. Here is the updated example:

net = gluon.nn.Embedding(1000, 50)
net.initialize()

x = nd.cast(nd.clip(nd.random_uniform(0, 1000, shape=(100,)), 0, 999), 'int32')
with autograd.record():
    emb = net(x)  # emb.shape=(100, 50)
    out = nd.split(emb, 100, axis=0)  # replace nd.mean with some loss calculation
grads = list()
for o in out:
    o.backward(retain_graph=True)
    grads.append(net.weight.grad())
grads = nd.concat(*grads, dim=0)

Please note that with the above code, each backward() call replaces the gradients of the previous backward() call. If you intent to accumulate gradients in net.weight’s gradient NDArray to be used with optimizer, you’d need to set net.weight.grad_req='add' and keep in mind that each time you call backward, gradients are summed, so you’d have to subtract previous gradient value from current one to get the value of the current backward() pass.