Hi all,

I’m hoping the MXNet hive mind can clarify something for me. I was reading up on MXNet’s support for sparse gradients; specifically, with embedding models in mind. The documentation on RowSparseNDArray says:

In MXNet, sparse gradient updates are applied when weight, state and gradient are all in

`row_sparse`

storage.

This statement implies that sparse gradient updates *only* occur when *all three* variables are row_sparse.

In embedding models, the gradient of the loss is typically sparse, because only a few embeddings are involved in each minibatch. However, the weight matrix, which contains all of the embeddings, is typically *dense*. The following (pseudo)code is a simplification of an embedding network:

```
# One-hot encoded mini-batch inputs:
X = mx.sparse.RowSparseNDArray(shape=(batch_size, n_items))
# Weights, i.e., item embeddings:
W = nd.array(shape=(n_items, d_embed))
# Embedding of mini-batch:
mb_embed = X.dot(W)
```

Note that `X`

is sparse, but `W`

and `mb_embed`

are dense. Now, if we took the gradient of `mb_embed`

w.r.t. `W`

, we would have a row_sparse matrix; in fact, the gradient is simply `X`

. But we shouldn’t *need* `W`

to be sparse.

The documentation on Embedding layers is vague. The Symbol API doc says the following about mxnet.symbol.Embedding:

The storage type of

`weight`

can be either row_sparse or default.

There is also a constructor argument, `sparse_grad`

:

If

`sparse_grad`

is set to`True`

, the storage type of gradient w.r.t weights will be “row_sparse”.

Does this behavior depend on `weight`

also being row_sparse? It shouldn’t, but I can’t find the implementation, so I can’t check for myself. Gluon’s API doc does not explicitly discuss the sparsity of `weight`

, but the description, “Turns non-negative integers (indexes/tokens) into dense vectors” implies that `weight`

is dense. It also has the `sparse_grad`

argument.

**Is the doc for RowSparseNDArray simply out of date? Are sparse gradients now supported with dense weights? If not – that is, if the RowSparseNDArray doc is correct, and sparse gradients require sparse weights – then why?**

Thanks in advance for your help!

== Ben