I have a model that leverages model parallelism in the Module/Symbol API that I am porting to Gluon. My question relates to best practices with HybridBlocks and model parallelism. The only example of model parallelism with Gluon I have found is available is here, which is mentioned in various StackOverflow mentions of Gluon and Model Parallelism. In the example notebook, Gluon Trainer objects are initialized and updated separately over various independent Block objects.
The tutorial specifically does not leverage speed ups by using HybridBlocks and hybridizing. In a similarly structured model parallel design with HybridBlocks, I would technically have multiple cached sub-graphs, rather than one large graph representing the model parallel pieces. This is in contrast to using the Module API where I could use mx.AttrScope(ctx_group) and the Symbol API to get a single model-parallel graph with context assignments.
What is the best practice for model parallelism in Gluon with HybridBlocks in terms of Trainer objects and minimizing the distinct number of sub-graphs cached? Is it possible to use a single Trainer object and use .as_in_context() within HybridBlocks to move layer outputs across GPUs while retaining the path dependency of the network for backprop?