I’m hoping the MXNet hive mind can clarify something for me. I was reading up on MXNet’s support for sparse gradients; specifically, with embedding models in mind. The documentation on RowSparseNDArray says:
In MXNet, sparse gradient updates are applied when weight, state and gradient are all in
This statement implies that sparse gradient updates only occur when all three variables are row_sparse.
In embedding models, the gradient of the loss is typically sparse, because only a few embeddings are involved in each minibatch. However, the weight matrix, which contains all of the embeddings, is typically dense. The following (pseudo)code is a simplification of an embedding network:
# One-hot encoded mini-batch inputs: X = mx.sparse.RowSparseNDArray(shape=(batch_size, n_items)) # Weights, i.e., item embeddings: W = nd.array(shape=(n_items, d_embed)) # Embedding of mini-batch: mb_embed = X.dot(W)
X is sparse, but
mb_embed are dense. Now, if we took the gradient of
W, we would have a row_sparse matrix; in fact, the gradient is simply
X. But we shouldn’t need
W to be sparse.
The documentation on Embedding layers is vague. The Symbol API doc says the following about mxnet.symbol.Embedding:
The storage type of
weightcan be either row_sparse or default.
There is also a constructor argument,
sparse_gradis set to
True, the storage type of gradient w.r.t weights will be “row_sparse”.
Does this behavior depend on
weight also being row_sparse? It shouldn’t, but I can’t find the implementation, so I can’t check for myself. Gluon’s API doc does not explicitly discuss the sparsity of
weight, but the description, “Turns non-negative integers (indexes/tokens) into dense vectors” implies that
weight is dense. It also has the
Is the doc for RowSparseNDArray simply out of date? Are sparse gradients now supported with dense weights? If not – that is, if the RowSparseNDArray doc is correct, and sparse gradients require sparse weights – then why?
Thanks in advance for your help!