Does anybody really need model parallelism?


I’m wondering if anybody really uses model parallelism in modern GPUs. It looks like you can always fit a model in one GPU and do data parallelism to increase the batch size if required.

Also, there are these techniques to reduce the memory consumption to fit the model in one GPU:

  • Use FP16.
  • Reduce the number of parameters/layers in model.
  • Reduce the size of the input e.g. the number of input features.
  • Reduce batch size and do data parallelism.
  • Decompose the problem into submodels that can be trained separately.

What are the real uses cases (if any) for model parallelism? If there is none, should we remove this FAQ on model parallelism so that we don’t confuse users (especially beginners) on whether to use data or model parallelism. The answer seems to be - always data parallelism.


I’m not convinced by your compression/training strategies that we can elliminate model parallelism. From FAQ, still large speech kind of applications need model parallelism as well as nowadays HD 3D images/videos
are becoming more common. So it wouldn’t be a strategic move to forget about model parallelism.

Regarding batch size, there’re still to be known though! For example, compare the seemingly contradictory results small vs. big. Also note that if you increase data parallelism (4 gpus for example), you might need to train more epochs (than using 2 gpus) to get to the accuracy level of less data parallel (i.e. practical time-accuracy tradeoffs).