Does anybody really need model parallelism?

#1

I’m wondering if anybody really uses model parallelism in modern GPUs. It looks like you can always fit a model in one GPU and do data parallelism to increase the batch size if required.

Also, there are these techniques to reduce the memory consumption to fit the model in one GPU:

  • Use FP16.
  • Reduce the number of parameters/layers in model.
  • Reduce the size of the input e.g. the number of input features.
  • Reduce batch size and do data parallelism.
  • Decompose the problem into submodels that can be trained separately.

What are the real uses cases (if any) for model parallelism? If there is none, should we remove this FAQ on model parallelism so that we don’t confuse users (especially beginners) on whether to use data or model parallelism. The answer seems to be - always data parallelism.

#2

I’m not convinced by your compression/training strategies that we can elliminate model parallelism. From FAQ, still large speech kind of applications need model parallelism as well as nowadays HD 3D images/videos
are becoming more common. So it wouldn’t be a strategic move to forget about model parallelism.

Regarding batch size, there’re still to be known though! For example, compare the seemingly contradictory results small vs. big. Also note that if you increase data parallelism (4 gpus for example), you might need to train more epochs (than using 2 gpus) to get to the accuracy level of less data parallel (i.e. practical time-accuracy tradeoffs).

#3

Hi @indu,

I am currently exploring this opportunity, the reason being that I have a medium size network (~60M params) for semantic segmentation that needs both 2D and 3D convolutions (spatio-temporal data). It’s the 3D that’s the problem, it has a very large memory footprint (and I’ve decomposed 3D into 1Time+2Space in the spirit of MobileNet convolutions). As a result, if I want to use 256x256 chips, I cannot fit more than 2 data points per GPU in the model. Unfortunately, for BatchNormalization to work properly, I need large batch size per GPU - at least 32 - to have stable training (in my experiments, this is because batch norm is not synchronous across gpus, and if I use the sync version it is provided in the contrib package is very slow). So I am applying data parallelism (using 4 GPUs simultaneously) with delayed gradients to increase batch size. Alternatively, I can reduce the size of the chips to 128x128 to make my model work, but then I am loosing context information that helps in the segmentation.

With colleagues I’ve talked from medical imaging within CSIRO one of their biggest problems in 3D segmentations (whene they process MRI scans etc) is memory footprint. I think mxnet - especially now with the gluon api that continuously gains attention - can benefit from this by pioneering convenience of use in model parallelism.

All the best

1 Like