Moving PyMC3 from Theano to MXNet

aseyboldt · October 7, 2017, 4:01pm

I am one of the developers of PyMC3, a package for bayesian statistics. At the moment we use Theano as backend, but as you might have heard development of Theano is about to stop. Currently, we are looking at TensorFlow, MXNet and PyTorch as possible replacements.

Our use case for MxNet would be different to most deep learning applications in some ways:

We do not build models ourselves, but provide building blocks and inference algorithms for users. Users would use both our library and mxnet in their models. Not all our users come from a computer science background, but from many different fields in science and business.
Many models in PyMC3 are much smaller than the typical neural net, but in many cases we require a larger number of function and gradient evaluations. Even relatively small overhead on the python side or inside the computational graph can add up quickly.
Custom Ops tend to be relatively common, e.g. for calling into external code or to minimise functions

Do you think MxNet could be a good fit for this use case?

While prototyping we ran into a couple of issues (in no particular order):

It is important that users get good error messages for misspecified models, that point to the correct source of the problem. If you define symbols with incompatible shapes and add them, you currently don’t get an exception until you call bind or infer_shape. (i.e. mx.sym.var(‘a’, shape=2) + mx.sym.var(‘b’, shape=3) doesn’t raise an exception)
We couldn’t find a way to create a modified copy of a symbol. Theano allows us to replace inputs or parts of the graph by other subgraphs (the replace arg of theano.clone). This gives us a lot of flexibility to switch around inference algorithms for a given model.
mx.sym doesn’t have a concept of a scalar, which can lead to a lot of confusion in PyMC3 (and looking at the invalid memory access in #8133, for mxnet as well). This fact would be exposed to users, who usually are used to numpy.
There doesn’t seem to be support of constants. In large parts of our code base each value can be either a parameter or a constant depending on how it is used. We can’t easily represent constant values as symbols to make calls like mx.sym.broadcast_add(a, b) possible, independent of whether a and b are symbols, python scalars or numpy arrays. Similarly, we and our users can’t add or multiply symbols by numpy arrays. Would it be possible to implement something like mx.sym.constant, that returns a symbol with auxiliary data, such that this auxiliary data has a default value?
Some special functions like erf or the incomplete gamma and beta functions are missing at the moment. A function broadcast_sum_n would be very nice.
We can’t live without fancy indexing.

Do you think those could be solved or worked around in principle?

smolix · October 8, 2017, 10:40pm

First off, thanks for reaching out! Given that we currently don’t have a good Bayesian Statistics package in MxNet, this would be a welcome addition. Let me give you a few high level thoughts first on what might be possible before we dive into the details:

I think that some things could be done with regard to fast samplers on large domains and for discrete distributions. What I mean by that is to use things like the Alias trick, efficient proposals, etc. To get an idea of what I’m talking about, have a look at the canopy sampler that gives you a super fast sampler for large amounts of data, large numbers of categories and high dimensionality. Likewise, check out the fast LDA samplers.
Stochastic variational (a la Matt Hofmann and co.) is probably also a good direction to proceed. We recently added good sparse support to MxNet, both in terms of matrices, vectors but also in terms of gradients (that’s where it matters when you’re sampling).
More generally, the notion of XRPs (exchangeable random processes) that Vikash Mansinghka and co. proposed would also be quite relevant.
Some of the concerns that you’re mentioning are very much related to a not so efficient programming model (at least, something that isn’t very efficient if you want to take real advantage of a modern deep learning kit, regardless of MxNet/PyTorch/TF). That is, scalars are evil. They kill performance, in particular if you want to do things on GPUs. If you have a tiny model, that’s not a problem but if you have a full industrial-scale problem, this matters.
Custom operators are something that can be done efficiently. For more information check out the NNVM and TVM runtime that we just released on Friday. Adding ONNX support for it was a few hundred lines of code. So yes, you could probably compile such operators and use them. My personal hunch, though, is that if you’re using this feature a lot, you might want to think about changing your programming model a bit (see above) by trying to offload the bulk of the computation on a small number of highly optimized compute kernels.
For very large problems you probably need an efficient way of updating parameters in a distributed setting. That’s where the parameter server comes in handy (look at Mu Li’s paper for more details). FWIW - this was the precursor to the distributed parts of MxNet.
Lastly, some of the questions about the programming model are pretty much MxNet.symbol focused. There’s also an imperative API - Gluon. It’s quite flexible and by now most of the relevant bits have been pushed back into C/C++ (some brilliant engineer even ported it to Perl …).

In summary, I would suggest that this is an opportunity to add functionality to PyMC rather than just to port it over to a new framework. After all, one of the big reasons deep learning has taken over big time is that Bayesian Statistics at scale is hard and you get to write a PhD thesis by implementing 3 algorithms well. In deep learning you might get that done in a month, due to good tooling. Now for the specific questions. Replies inline (and apologies for maybe not having grokked PyMC in sufficient detail - I only went through the tutorials and overview). I removed the ones that are implicitly answered above.

We do not build models ourselves, but provide building blocks and inference algorithms for users. Users would use both our library and mxnet in their models. Not all our users come from a computer science background, but from many different fields in science and business.

That’s actually not so different from how MxNet works. As for loop constructs, this is something we should talk about. Right now this works with Python but it would be reasonable to add explicitly.

Q It is important that users get good error messages for misspecified models, that point to the correct source of the problem. If you define symbols with incompatible shapes and add them, you currently don’t get an exception until you call bind or infer_shape. (i.e. mx.sym.var(‘a’, shape=2) + mx.sym.var(‘b’, shape=3) doesn’t raise an exception).
A This is actually a feature - basically we want to make sure that it’s really hard for our users to screw up. That is, if you have a 200-layer deep network (they exist) with convolutions and lots of other stuff, you really don’t want to have to modify all parameters when you decide to edit things on layer 42. This would be a very rich sources of errors. What you might consider is whether you couldn’t actually borrow the notion of automatic shape inference in PyMC. This makes it easier on the users. For instance, if you have y \sim \mathrm{Logistic}(M x) and you know the number of classes in y, then you won’t really need to bind to M until you instantiate the data.
Q We couldn’t find a way to create a modified copy of a symbol. Theano allows us to replace inputs or parts of the graph by other subgraphs (the replace arg of theano.clone). This gives us a lot of flexibility to switch around inference algorithms for a given model.
A We should discuss this in more detail. It’s a good suggestion. That said, if you use the imperative setting and you only have a small number of such pieces, you could simply use different functions for each of the components. This would at least be a principled workaround. As in f(g(x)) and h(g(x)) would allow you to do both.
Q There doesn’t seem to be support of constants. In large parts of our code base each value can be either a parameter or a constant depending on how it is used. We can’t easily represent constant values as symbols to make calls like mx.sym.broadcast_add(a, b) possible, independent of whether a and b are symbols, python scalars or numpy arrays. Similarly, we and our users can’t add or multiply symbols by numpy arrays. Would it be possible to implement something like mx.sym.constant, that returns a symbol with auxiliary data, such that this auxiliary data has a default value?
A I’m a bit confused here. It’s quite straightforward to determine with regard to which variables you want to compute gradients (I take it that this is pretty much why you care). Basically you only attach gradients to variables where you care. And you get the entire compute graph. Furthermore, you can always detach parts of the compute graph, thus making the relevant variable a constant (e.g. look at the LSTM example in the gluon tutorial - for BPTT you detach the ancestors).
Q Some special functions like erf or the incomplete gamma and beta functions are missing at the moment.
A Yes we need them. Adding them would be most welcome (we didn’t add them since we didn’t have a lot of people asking - bit of a chicken & egg problem).
Q We can’t live without fancy indexing.
A Understood. Several have asked. We need to add it.

piiswrong · October 9, 2017, 12:39am

I think for bayesian statistics it would make more sense to use Gluon instead of mxnet/module.

With gluon and HybridBlock you get both easy develop/debug and low overhead.

fancy indexing exist in an incomplete form for now (although you can hack it with nd.arange and nd.gather_nd). We are working on full support now. Should be there in a week or two.

We can have a call if you want.

aseyboldt · October 9, 2017, 11:10am

@smolix Thank you for the detailed response!
@piiswrong A call would be great. I am at GMT+1, most of the pymc team is scattered over the USA or Europe. Feel free to propose a time and date that’s convenient for you

About the constants

I guess I wasn’t particularly clear here. There isn’t an issue preventing us from doing what we want, it is more a matter of convenience, performance and beauty. Consider a simple model:

val1 = Beta(alpha=1, beta=np.array(0.5))
beta = val1 + np.array([0.5, 0.6])
val2 = Beta(alpha=val1, beta=beta, observed=np.array([0.9, 0.8]))

Somewhere, we need to implement the log density of the beta distribution. Each parameter, and also the value, could be a symbol, a python scalar or a numpy array. For a distribution with n parameters that gives us 3**(n+1) possible combinations of input types, that each have a different api. In Theano, we can simply cast each parameter to a theano variable. If it is a numpy array or a python scalar, it wraps it in a constant, which behaves like any other theano variable. Additionally, it performs constant propagation during graph optimisation, so that computations that are done purely on constants only happen once during compilation.
The way I’d do that in mxnet right now (using my best current understanding of the library, which I still don’t know particularly well), would be to create a new parameter / symbol for each input, block the gradient if it is a constant, and manually keep track of which of those inputs are actual parameters that we want to know the posterior for, and which ones are just constants. This should work, but it would clutter the inputs of the model, and also make any future constant-related optimisations in the graph impossible. MxNet doesn’t even know that those values are constant.
Conceptually, I think it makes a lot of sense to look at a constant as an Op that doesn’t have any inputs. The actual value can be stored as auxiliary data. A design like that would prevent the cluttering of the inputs, keep the raw graph (exported json!) clean and would at least in principle allow constant-related optimizations (readonly aux data that is known when bind is called).

Performance, big data, GPUs…

I indeed hope that switching to a larger deep learning library can allow us to scale to larger models and datasets. However, this isn’t necessarily our main focus. PyMC and Stan shine with complicated models for small or medium sized datasets, say up to ~50,000 parameters or so (not that the number of parameters is actually a good measure for how hard the problem is…). A huge number of real world problems fall in this category, often with computational and algorithmic difficulties. I realise that many things we do don’t work well on GPUs (scalars!), but this isn’t so much a question of a programming model, but a question of which problems we are trying to solve. It is my impression that scaling bayesian methods to larger models isn’t so much about computational efficiency or hardware utilisation. We (speaking for the whole of humanity here ) simply don’t know any algorithms that allow us to do so reliably. With NUTS we have a tool that works well for a very wide range of models without a lot of human intervention, scales reasonably with the dimensionality, approximates the actual posterior arbitrarily well and has useful diagnostics that tell us when it doesn’t work. We don’t have anything close to that for very large models.
All that being said, I completely agree that this could be an opportunity to extend pymc, and your list of methods is interesting, thanks.

Shape inference

I haven’t though about it much yet, but I don’t think automatic shape inference will work well for the average pymc model. We do this a bit as it is now, and it is a constant source of trouble. Some shape information is definitely needed, sometimes the shape just can’t be infered automatically (mostly due to broadcasting):

# both of those are valid, but have different meaning
mu = Normal(mu=0, sd=1, shape=(2, 3))
mu = Normal(mu=0, sd=1, shape=())

Normal(mu=mu, sd=1, observed=data)

Gluon vs mx.symbol

The reason I’ve mostly been looking into mx.symbol is that this seemed to give us more control. For example, when implementing Hamiltonian sampling, we’d like to look at all parameters together as a vector. With the symbol api we can do that easily:

a = mx.sym.var('a', shape=(1,))
b = mx.sym.var('b', shape=(2,))

params = mx.nd.zeros(3)
params[:] = np.array([0.1, 0.2, 0.3])

a_ = params[:1]
b_ = params[1:]
func = (a + b).bind(ctx, {'a': a_, 'b': b_})

params[:] = new_vector  # set all parameters together (same possible for grad)
func.forward()

From my preliminary benchmarks that seems to save a lot of time when we have a lot of smaller parameters.

But looking at the gluon api a bit I guess this might still be possible with it, as we can get access to the symbols regardless.

junpenglao · October 9, 2017, 12:47pm

Some details just wanted to add to @aseyboldt’s comment: as the name suggested, PyMC3 mostly focus on MCMC in the past (we have some great new development in Variational Inference as well). The general (classical framework) is we generate a log-likelihood function from some model container, and for modern MCMC method (e.g., NUTS) we also want to take the derivative of the logp function (higher-order derivative would be great too). So instead of caring about drawing random samples fast, we care more about the speed of evaluating the logp function and its derivative.

piiswrong · October 9, 2017, 5:29pm

imperative (gluon) is a strict superset of symbol. So anything you can do in symbol can be done in gluon with ndarray. And you can make part of the graph symbolic with hybridize if it doesn’t use ndarray specific features

smolix · October 9, 2017, 5:31pm

In other words, if it’s in mxnet.symbol or in PyTorch or Chainer, it’s probably also in Gluon …

altosaar · October 16, 2017, 9:46pm

Thanks for the discussion!

Agree with a lot of this, especially @smolix – I just implemented variational inference for a set of deep generative models (deep exponential families). It was a lot easier with gluon:

The sparse support didn’t make it faster for the problem I was considering. And I failed at trying to ‘hybridize’ it – I think I would need to learn about the inner workings of gluon.HybridBlocks to get that to work properly (it seemed too complicated for now).

And per-sample gradients are a necessity but not supported (https://github.com/apache/incubator-mxnet/issues/7987)

Excited to see gluon evolve + happy to discuss further

aseyboldt · November 8, 2017, 5:53pm

Hi, thanks for taking the time for that meeting by the way, it was fun
One thing I forgot to mention during the meeting is support for constant values. I wrote about this here before, but I guess it got lost a bit. Just wondering if you have any thought about that. (see also the separate thread about a possible design)

Since I wrote this I also noticed that it seems you ran into this issue yourselves, while implementing random draws from different distribution. The functions seem to assume that all parameters are of the same type. This is exactly the issue I am trying to work around.

aseyboldt:

Somewhere, we need to implement the log density of the beta distribution. Each parameter, and also the value, could be a symbol, a python scalar or a numpy array. For a distribution with n parameters that gives us 3**(n+1) possible combinations of input types, that each have a different api. In Theano, we can simply cast each parameter to a theano variable. If it is a numpy array or a python scalar, it wraps it in a constant, which behaves like any other theano variable. Additionally, it performs constant propagation during graph optimisation, so that computations that are done purely on constants only happen once during compilation.

The way I’d do that in mxnet right now (using my best current understanding of the library, which I still don’t know particularly well), would be to create a new parameter / symbol for each input, block the gradient if it is a constant, and manually keep track of which of those inputs are actual parameters that we want to know the posterior for, and which ones are just constants. This should work, but it would clutter the inputs of the model, and also make any future constant-related optimisations in the graph impossible. MxNet doesn’t even know that those values are constant.

Conceptually, I think it makes a lot of sense to look at a constant as an Op that doesn’t have any inputs. The actual value can be stored as auxiliary data. A design like that would prevent the cluttering of the inputs, keep the raw graph (exported json!) clean and would at least in principle allow constant-related optimizations (readonly aux data that is known when bind is called).

piiswrong · November 10, 2017, 11:19pm

Yeah that’s possible. We don’t even need a aux. Its basically an operator that outputs the same array every time.

For now you can pretty easily make one with mx.operator.CustomOp. Will let you know when we can make this builtin

choibaisunwin · June 3, 2019, 8:19am

In other words, if it’s in mxnet.symbol or in PyTorch or Chainer, it’s probably also in Gluon …

Topic		Replies	Views
Maximizing with a non-standard probability model Discussion	1	425	November 17, 2017
Using MXNet from Java application Discussion java , general-question	7	3174	January 28, 2019
Possible to Use MXNet gluon.Trainer without a Neural Network? Gluon	1	1009	April 23, 2018
Export an mxnet model which is train in python and test in c++ Discussion	2	969	August 18, 2018
Mxnet model to onnx Discussion	1	1000	August 15, 2019

Moving PyMC3 from Theano to MXNet

About the constants

Performance, big data, GPUs…

Shape inference

Gluon vs mx.symbol

Related Topics