I’m wondering if anybody really uses model parallelism in modern GPUs. It looks like you can always fit a model in one GPU and do data parallelism to increase the batch size if required.
Also, there are these techniques to reduce the memory consumption to fit the model in one GPU:
- Use FP16.
- Reduce the number of parameters/layers in model.
- Reduce the size of the input e.g. the number of input features.
- Reduce batch size and do data parallelism.
- Decompose the problem into submodels that can be trained separately.
What are the real uses cases (if any) for model parallelism? If there is none, should we remove this FAQ on model parallelism so that we don’t confuse users (especially beginners) on whether to use data or model parallelism. The answer seems to be - always data parallelism.