Rank 0 arrays in MXNet aka pi is wrong


#1

This issue has been bothering me all week, and much more than is remotely reasonable. So I though I’d just try to write it down and see if I’m the only one worrying about this or not. At least someone with the name @piiswrong might understand my unholy obsession with a bad definition.

The short version: MXNet doesn’t support arrays of rank 0 (arrays where the shape is an empty tuple). In all places where one would usually use them, it uses vectors of length 1 (arrays with shape (1,)). This is wrong.

Let’s start with a bit of background. Multidimansional array libraries are usually in some way inspired by tensors – mathematical objects from linear algebra that generalize vectors and matrices. They ignore parts (like covariance and contravariance), but mosly use similar nomenclature and semantics. Each tensor has a rank (also sometimes ndim or the length of the shape), the number of indices that are needed to end up with a scalar. A scalar has rank 0 (no indexing is needed until we get a scalar), a vector has rank 1 (we need one index) and a matrix has rank 2. When we provide m indices for a rank n tensor, we get a rank m - n tensor. When we stack tensors of rank n, we get a tensor of rank n + 1. When we reduce m axes from a rank n tensor, we end up with a rank n - m tensor. If we reduce all n axis we get a rank n - n = 0 tensor – a scalar. Reductions and indexing decrease the rank, stacking increases it.

MXNet conflates two different things: A rank 1 array with shape (1,) and a rank 0 array with shape (). Both have size prod(shape) = 1. (Empty products are defined as 1), but they have different shapes.

So why should we care?

  • This beaks all the nice properties of indexing and ranks. As a consequence it produces strange corner cases. For example, in numpy it is always equivalent (although maybe slower) to do indexing one at a time: a[i, j] is always the same as a[i][j]. This is not true in MXNet. If a = mx.nd.zeros((2,)), then a[0, 0] raises an error, but a[0][0] doesn’t. In fact, one can continue indexing as long as one wants: a[0][0][0][0] is still valid. Or in numpy we can be sure that a.sum(-1) has fewer dimensions that a. Not so in MXNet: a.sum(-1).shape == (1,) and a.shape == (2,).
  • Each dimension in datasets usually has a meaning. In many formats for scientific datasets this is even formalized, eg in netcdf each dimension has associated coordinates, just like the index of a pandas or R dataframe. If we think of scalars as rank 1 arrays, we would have to be able to answer the questing: What coordinates does that one dimension have. It is in the nature of scalar values that this question doesn’t have an answer. Let’s say for example that we have a dataset of how many people died in different years and different cities from cholera. We can store that in a rank 2 array. The first dimension is the city, the second the year. The values in the array are the number of people who died. If we sum over the first axis (a.sum(0)), we get a rank 1 array, where the remaining coordinate is the year. If we sum over the second axis (a.sum(1)) the remaining coordinate is the city. If we sum over both, then what coordinates should that in MXNet remaining axis have? It is just the total number of people who died from cholera, there are no coordinates in sight anywhere.
  • If we want to stack arrays in MXNet of shape (1,), we can’t know what the shape of the result should be. Let’s say we get a bunch of (1,) arrays. They could be the trace of a parameter during optimisation, or samples from the posterior of a parameter. If we stack them together into one array, should the result have shape (n,) or shape (n, 1)? The first would be what we want if our individual arrays are scalars, the second if they are vectors that happen to have length 1.
  • It might introduce bugs. I’m guessing that most people using mxnet are familiar with numpy, so if mxnet has subtly different indexing behaviour than numpy, that might trip up a lot of people. But it also seems to lead to trouble within mxnet and its libraries itself, see for example issue #8239, or the fact that tvm segfaults if it tries to generate code that reduces all axes of an array. And those are just shape related bugs that I ran into before I realised mxnet doesn’t use scalars, I haven’t been actively looking for them.
  • Interoperability. Most other frameworks get this right by now. Older languages sometimes didn’t (R and matlab come to mind), but all other deep learning frameworks I know do. Especially since nnvm tries to provide a common format for sharing graphs, this seems very relevant. It is impossible to convert between different learning frameworks seamlessly if the indexing semantics differ.

I don’t think this can be fixed without any incompatible changes, but hopefully it wouldn’t impact all that much code. Most deep learning applications don’t seem to deal with scalars that much. Any chance this might happen at some point?


#2

That’s about as cogent a reasoning as I’ve ever heard for this. Wow! Let me circle back with @piiswrong and @zack how much this would break and how painful it would be.


#3

These are all fair arguments but due to legacy reasons MXNet was built without scalar support. Adding it is highly non-trivial and will involve many non-backward-compatible changes.

We may be able to do this when we make gluon a separate package or when we release mxnet 2.0


#4

Because of a recent discussion, I thought I’d write a bit more about this, especially I’ll try to explain a bit better why I think this is important (some thinking in the meantime helped me a lot to figure this out for myself as well). I know of course that backward compatibility is important, and that developer time is very much a finite resource. While I’d really like a change to support scalars (enough to put in some work myself), I fully understand if this isn’t going to happen.

A useful way of looking at this is by considering indexes. As great as numpy is, in statistical applications, we really want to have the features of pandas. When working on real datasets with raw numpy it is incredibly easy to get lost about what number was supposed to mean what. Adding an index to arrays makes this much easier and also helps tremendously with plotting, as we don’t need to figure out labels all the time ourselves (and if you get those wrong it can end up being very embarrassing). I hope I don’t need to argue about how useful that is, the popularity of pandas kind of speaks for itself I think.

Right now, neither sklearn, PyMC3 or Stan really take advantage of indexes (most of R does, and so do statsmodels and rstanarm). Instead, we read in data with an index, forget that index and put raw arrays into our model, do some stuff and try to manually figure out the index for the results. This isn’t nice. For sklearn there is some third party support for working with indexes directly: sklearn-pandas and ibex. This seems fairly popular if you take into account that both of those are just glorified hacks (I’m not trying to criticise the authors, I think they are doing a great job under the circumstances). If and when we move PyMC to MXNet, I’d like to change this and add some support for indexes. It seems to me that the missing support for scalars adds a whole lot of strange user visible corner cases to any system for that.

The concept of an index is quite similar to the concept of a basis, it tells us what those numbers in the array actually mean. Just as we need one basis per dimension of a tensor, we also need one index per dimension of an array. Interestingly, pandas doesn’t add index support for numpy in general, but only for one and two dimensional arrays. For scalars there really isn’t anything to add to numpy. pd.Series is for one dimensional arrays (the index is just pd.Seriers.index) and pd.DataFrame for two dimensional arrays (the indexes are pd.DataFrame.index and pd.DataFrame.columns). It also used to have a Panel for 3d data, but this was removed recently in favor of the external package xarray. For applications where we need support for the whole general n-dimensional case (in PyMC3 we always have at least two dimensions, one for the different samples from the posterior, and one because we sample in several chains), we need to look elsewhere for clues about how to do this properly. Luckily, people have been doing this for quite some time, mostly in the geosciences for storing climate and weather data or similar as far as I can tell. netcdf seems to be the most advanced version of this, and with xarray we have good support for this data model in python, too. The basic idea is that each dimension in an xarray.DataArray has a name, and also an index. This can then be used for plotting, or to select or combine data.

To give you an idea about how this could end up in PyMC3, let’s look at a very small example. Say we have a dataset with survival times of some cancer patients who were treated with different drugs, and we are trying to figure out if some of those drugs worked better than others. The dataset could maybe look a bit like this:

>>> data.head()
 	sex	    survival	treatment
0	female	10.520381	sorafenib
1	male	33.079531	doxorubicin
2	female	17.246785	sorafenib
3	male	6.151865	sorafenib
4	female	2.245799	doxorubicin

For illustration only, a very simple (and somewhat stupid) model could involve the following parameters:

  • intercept: How long do people with this kind of cancer live in general
  • sex: One parameter, that says how much longer or shorter men live, and one parameter how much longer or shorter women live on average
  • treatment: One parameter for each treatment, how does that treatment influence the survival time?
  • sd: How much does the survival vary between persons, given that they share the same sex and treatment.

In a future pymc version, with basic support for labeled indexes, this could look a little like this:

# Which dimensions do we look at in the whole model
coords = {
    # We consider the two treatments: sorafenib and doxorubicin
    'treatment': pd.Index(data.treatment.unique()),
    # Each partient has a unique id
    'patient': data.index,
    # Patients can be male or female
    'sex': pd.Index(data.sex.unique()),
}

with pm.Model(coords=coords) as model:
    # The intercept is just a number, it doesn't have any
    # index associated to it.
    intercept = pm.Flat('intercept', dims=())

    # One value for each of the two treatments
    treatment = pm.Cauchy('treatment_effect', dims=('treatment',))

    # One value for each of the two sexes
    sex = pm.Cauchy('sex_effect', dims=('sex',))

    # An interaction term might have dims=('sex', 'treatment')
    
    treat_idx = coords['treatment'].get_indexer(data.treatment)
    sex_idx = coords['sex'].get_indexer(data.sex)

    mu = (intercept
        + treatment[treat_idx]
        + sex[sex_idx])
    
    # Again, just a number
    sd = pm.HalfCauchy('sd', beta=1, dims=())
    pm.Normal('survival', mu=mu, sd=sd, observed=np.log(data.survival))

After sampling, we get an xarray.DataArray for each parameter, but to each parameter, we add two additional dimensions: 'chain' and 'sample'. So intercept should have dims=('chain', 'sample'), and treatment_effect should have dims=('chain', 'sample', 'treatment').

Without support for scalars, the dims of intercept would be ('chain', 'sample', 1) however. The last dimension has no name, no index and no meaning. It just shouldn’t be there in the first place. The same happens with reductions, they introduces new dimensions, that do not have associated names and indexes.

Upgrade path

Maybe there is a way to get this into mxnet without api breakage. Let’s break the problem into parts a bit:

  • nnvm can not store or work with graphs that involve scalars, as it uses empty tuples as marker for unknown shapes. (In the class TShape)
  • The shape inference functions in nnvm for reductions assume that reductions return at least shape (1,)
  • custom op implementations might depend on the fact that empty tuples are markers for unknown shapes during shape inference.
  • The implementations of reductions in mxnet return at least shape (1,), they are essentially equivalent to the numpy np.atleast1d(array.sum(*args)).
  • infer_shape_partial returns empty tuples as markers for unknown shapes.

Allowing nnvm to work with scalars could be useful for all kinds of unrelated things as well. For example right now it is impossible to represent the quite simple tensorflow graph tf.placeholder().sum() in nnvm. From the architecture of nnvm I can’t really see why it couldn’t support both. A way to deal with the first three issues might be to add explicit support for known scalar shapes to TShape, eg add an additional field known_scalar. If a shape inference function returns an empty shape with this field set, treat the result as scalar. If known_scalar defaults to false, this shouldn’t break any existing Ops. All reduction operations could get an additional attribute atleast1d. If this is defined and set to false, the result should be a scalar when appropriate.

So from the point of view of nnvm, this would strictly be additional functionality.

The actual implementations for the reductions in mxnet seem to mostly share a single template implementation. This could be changed to just compute the atleast1d=False version, and if atleast1d=True and the result happens to be a scalar, return atleast1d(result).

I’m not sure if the existing mxnet c-api functions can be adapted to take advantage of this additional functionality. If not, it should be possible to just add functions instead of change the existing ones.

On the python side all reductions could get an additional kwarg atleast1d, that defaults to True. infer_shape_partial could get a kwarg unknown_marker, which can be either None or (), with a default of ().

If at some point in the future someone wanted to make the scalar versions the default, this would be I think far easier than just changing it suddenly from the way it is now. There could be a period where usage that would change in the next version prints a warning.

In general I think it would be better to do a change like this sooner than later (if at all that is of course). The arguments are not going to change, only more and more people might depend on the current behaviour.


#5

Thanks for the thoughtful suggestions. Most of them makes sense.

One other issue is if x=array([1,2,3]) then x[1] returns [2] instead of scalar 2.

It’s probably better to have a global flag that you can set to indicate whether you want legacy mode (no scalar) or new mode (with scalar support).

This is a lot of work though. Will take some time before we can get to it.


#6

Another backward compatible way of fixing the indexing is an accessor like pandas .loc. So maybe something like x[1] returns [2] and x.sc[1] returns 2. But a global flag might be the better idea, it doesn’t clutter the api as much.


#7

I just stumbled across this passage in the pytorch distribution design document:

The size of a sample

	Normal(mu, sigma).sample(sample_shape)

is given by

	sample_shape + batch_shape + event_shape

Modeling the behavior in PyTorch after that in Tensorflow seems like the way to go. However, we need to be careful about edge cases that arise from the fact that PyTorch currently does not support scalar Variables.
As an example, what should the event_shape be for distributions that have scalar samples? There are two choices here

  1. Scalar samples have event_shape = (,). This means that a sample Normal(mu,sigma).sample() has the same size as mu and sigma.
  2. Scalar samples have event_shape = (1,). This means that a sample Normal(mu,sigma).sample() has the size as mu.size() + (1,).

As a second question, what should the batch_shape be when you define Normal(0,1)? Here again there are two possible choices

  1. Normal(0,1) has batch_shape = (,). This means that a sample Normal(0,1).sample() should have size (,), which is promoted to (1,). Normal(0,1).sample(1) will also have size (1,)
  2. Normal(0,1) has batch_shape = (1,). This means that Normal(0,1).sample(1) will have size (1,1)

These question disappear once PyTorch implements support for scalar Variables.