Hi,
The group context can be leveraged to implement the model parallelism. A common scenario is to split weight matrix into sub-matrix and distribute among different GPUs. However the generated model will also be represented with the sub-matrix distributed on different GPUs. This will lead to the requirement of multi-GPU for prediction scenario. Is there a way to consolidate the model by merging split matrix such that prediction can be done in single GPU?