Model Parallelism with Hybrid Blocks

I have a model that leverages model parallelism in the Module/Symbol API that I am porting to Gluon. My question relates to best practices with HybridBlocks and model parallelism. The only example of model parallelism with Gluon I have found is available is here, which is mentioned in various StackOverflow mentions of Gluon and Model Parallelism. In the example notebook, Gluon Trainer objects are initialized and updated separately over various independent Block objects.

The tutorial specifically does not leverage speed ups by using HybridBlocks and hybridizing. In a similarly structured model parallel design with HybridBlocks, I would technically have multiple cached sub-graphs, rather than one large graph representing the model parallel pieces. This is in contrast to using the Module API where I could use mx.AttrScope(ctx_group) and the Symbol API to get a single model-parallel graph with context assignments.

What is the best practice for model parallelism in Gluon with HybridBlocks in terms of Trainer objects and minimizing the distinct number of sub-graphs cached? Is it possible to use a single Trainer object and use .as_in_context() within HybridBlocks to move layer outputs across GPUs while retaining the path dependency of the network for backprop?

For the case with multi-layer LSTM, it can be easily changed to use HybridBlock for each of the LSTM layer. This way, we have a cached subgraph on each of the GPUs.

HybridBlock is useful usually if you have a complicated network and it can perform lots of memory optimizations. With regard to memory optimization across different GPUs, I am afraid that’s not very easy to do. So having a separate HybridBlock per GPU looks fine to me.

However, the optimization for model parallel heavily depends on the network structure itself, and how you parallelism the computation. What is your network like on a high level?

Thanks for the response Haibin!

My original inclination was to create a single HybridBlock containing the entire network across multiple GPUs, using .as_in_context() to synchronize across GPUs. This is currently not a supported method under the Symbol API. The goal was to have one Trainer object for the entire model, rather than a Trainer object across each GPU, allowing training to perform identically to training on a single device. In a multiple LSTM model where there are two GPUs and two LSTM layers, the idea would be to have a single HybridBlock with child blocks LSTM1 and LSTM2. LSTM1 would be attached to context1 (and input data would be for context1). The return for the child LSTM1 hybrid_forward would be something like lstm1_output.as_in_context(context2), which in the cached graph is a Symbol object. LSTM2 then would be applied in context2.

For HybridBlocks, after calling .hybridize(), this will not currently work due to .as_in_context() currently not being implemented in the Symbol API. Would this theoretically work if the Symbol.as_in_context() method was implemented? Would this cause an unnecessary memory overhead (i.e. caching the full multiple GPU graph on each GPU)?

At a high level, I am interested in a network where each individual layer is split across multiple GPUs with synchronization after each layer.

Hi @swe, thanks for filling in the details. You’re right that as_in_context is not supported in Symbol, due to it’s “across context” nature. And I don’t see this function to be implemented in the near future before mxnet 2.0.

WIth regard to Trainer, you’re right that multiple Trainer is required in that example. Did you find any problem if you use one Trainer per layer even if there is a single context? For now I don’t see there is any problem if you use the same code for both single GPU and multi-GPU training. In fact, if you have one hybrid block per GPU, the cached graph is only the subgraph that belongs to that GPU.

For now I’d suggest stick to the strategy adopted in the example code…