This issue has been bothering me all week, and much more than is remotely reasonable. So I though I’d just try to write it down and see if I’m the only one worrying about this or not. At least someone with the name @piiswrong might understand my unholy obsession with a bad definition.

The short version: MXNet doesn’t support arrays of rank 0 (arrays where the shape is an empty tuple). In all places where one would usually use them, it uses vectors of length 1 (arrays with shape `(1,)`

). This is wrong.

Let’s start with a bit of background. Multidimansional array libraries are usually in some way inspired by tensors – mathematical objects from linear algebra that generalize vectors and matrices. They ignore parts (like covariance and contravariance), but mosly use similar nomenclature and semantics. Each tensor has a `rank`

(also sometimes `ndim`

or the length of the shape), the number of indices that are needed to end up with a scalar. A scalar has rank 0 (no indexing is needed until we get a scalar), a vector has rank 1 (we need one index) and a matrix has rank 2. When we provide `m`

indices for a rank `n`

tensor, we get a rank `m - n`

tensor. When we stack tensors of rank `n`

, we get a tensor of rank `n + 1`

. When we reduce `m`

axes from a rank `n`

tensor, we end up with a rank `n - m`

tensor. If we reduce all `n`

axis we get a rank `n - n = 0`

tensor – a scalar. Reductions and indexing decrease the rank, stacking increases it.

MXNet conflates two different things: A rank 1 array with shape `(1,)`

and a rank 0 array with shape `()`

. Both have size `prod(shape) = 1`

. (Empty products are defined as 1), but they have different shapes.

So why should we care?

- This beaks all the nice properties of indexing and ranks. As a consequence it produces strange corner cases. For example, in numpy it is always equivalent (although maybe slower) to do indexing one at a time:
`a[i, j]`

is always the same as`a[i][j]`

. This is not true in MXNet. If`a = mx.nd.zeros((2,))`

, then`a[0, 0]`

raises an error, but`a[0][0]`

doesn’t. In fact, one can continue indexing as long as one wants:`a[0][0][0][0]`

is still valid. Or in numpy we can be sure that`a.sum(-1)`

has fewer dimensions that`a`

. Not so in MXNet:`a.sum(-1).shape == (1,)`

and`a.shape == (2,)`

. - Each dimension in datasets usually has a meaning. In many formats for scientific datasets this is even formalized, eg in netcdf each dimension has associated coordinates, just like the index of a pandas or R dataframe. If we think of scalars as rank 1 arrays, we would have to be able to answer the questing: What coordinates does that one dimension have. It is in the nature of scalar values that this question doesn’t have an answer. Let’s say for example that we have a dataset of how many people died in different years and different cities from cholera. We can store that in a rank 2 array. The first dimension is the city, the second the year. The values in the array are the number of people who died. If we sum over the first axis (
`a.sum(0)`

), we get a rank 1 array, where the remaining coordinate is the year. If we sum over the second axis (`a.sum(1)`

) the remaining coordinate is the city. If we sum over both, then what coordinates should that in MXNet remaining axis have? It is just the total number of people who died from cholera, there are no coordinates in sight anywhere. - If we want to stack arrays in MXNet of shape
`(1,)`

, we can’t know what the shape of the result should be. Let’s say we get a bunch of`(1,)`

arrays. They could be the trace of a parameter during optimisation, or samples from the posterior of a parameter. If we stack them together into one array, should the result have shape`(n,)`

or shape`(n, 1)`

? The first would be what we want if our individual arrays are scalars, the second if they are vectors that happen to have length 1. - It might introduce bugs. I’m guessing that most people using mxnet are familiar with numpy, so if mxnet has subtly different indexing behaviour than numpy, that might trip up a lot of people. But it also seems to lead to trouble within mxnet and its libraries itself, see for example issue #8239, or the fact that tvm segfaults if it tries to generate code that reduces all axes of an array. And those are just shape related bugs that I ran into before I realised mxnet doesn’t use scalars, I haven’t been actively looking for them.
- Interoperability. Most other frameworks get this right by now. Older languages sometimes didn’t (R and matlab come to mind), but all other deep learning frameworks I know do. Especially since nnvm tries to provide a common format for sharing graphs, this seems very relevant. It is impossible to convert between different learning frameworks seamlessly if the indexing semantics differ.

I don’t think this can be fixed without *any* incompatible changes, but hopefully it wouldn’t impact all that much code. Most deep learning applications don’t seem to deal with scalars that much. Any chance this might happen at some point?