Confusion over implementation of Embedding: dense or row_sparse weights?


Hi all,

I’m hoping the MXNet hive mind can clarify something for me. I was reading up on MXNet’s support for sparse gradients; specifically, with embedding models in mind. The documentation on RowSparseNDArray says:

In MXNet, sparse gradient updates are applied when weight, state and gradient are all in row_sparse storage.

This statement implies that sparse gradient updates only occur when all three variables are row_sparse.

In embedding models, the gradient of the loss is typically sparse, because only a few embeddings are involved in each minibatch. However, the weight matrix, which contains all of the embeddings, is typically dense. The following (pseudo)code is a simplification of an embedding network:

# One-hot encoded mini-batch inputs:
X = mx.sparse.RowSparseNDArray(shape=(batch_size, n_items))
# Weights, i.e., item embeddings:
W = nd.array(shape=(n_items, d_embed))
# Embedding of mini-batch:
mb_embed =

Note that X is sparse, but W and mb_embed are dense. Now, if we took the gradient of mb_embed w.r.t. W, we would have a row_sparse matrix; in fact, the gradient is simply X. But we shouldn’t need W to be sparse.

The documentation on Embedding layers is vague. The Symbol API doc says the following about mxnet.symbol.Embedding:

The storage type of weight can be either row_sparse or default.

There is also a constructor argument, sparse_grad:

If sparse_grad is set to True, the storage type of gradient w.r.t weights will be “row_sparse”.

Does this behavior depend on weight also being row_sparse? It shouldn’t, but I can’t find the implementation, so I can’t check for myself. Gluon’s API doc does not explicitly discuss the sparsity of weight, but the description, “Turns non-negative integers (indexes/tokens) into dense vectors” implies that weight is dense. It also has the sparse_grad argument.

Is the doc for RowSparseNDArray simply out of date? Are sparse gradients now supported with dense weights? If not – that is, if the RowSparseNDArray doc is correct, and sparse gradients require sparse weights – then why?

Thanks in advance for your help!

== Ben


Thanks for pointing this out. Sparse gradients can be applied to both dense and rowsparse weights. I am updating the out-dated row_sparse ndarray tutorial in