How to do multi-gpu training on public SageMaker gluon example?

olivcruche · November 5, 2018, 9:33am

Hi, I’m training this public gluon example on a p2.16xl notebook https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/gluon_recommender_system

I’m trying to adapt the notebook to run on multi-GPU. In order to do this, I did the following changes:

replace ctx = mx.gpu() by ctx = [mx.gpu(i) for i in range(8)]
replace
user = user.as_in_context(ctx).reshape((batch_size,))
item = item.as_in_context(ctx).reshape((batch_size,))
label = label.as_in_context(ctx).reshape((batch_size,))
by
user = gluon.utils.split_and_load(user, ctx)
item = gluon.utils.split_and_load(item, ctx)
label = gluon.utils.split_and_load(label, ctx)

it throws the following error: AssertionError: HybridBlock requires the first argument to forward be either Symbol or NDArray, but got <class 'list'>

What am I missing?
thanks

olivcruche · November 5, 2018, 5:54pm

just got the answer by a colleague:
" You are correct, the output of split_and_load is a list. You need to iterate over it normally and the asynchronous mxnet engine will take care of the parallelism in the background. For example:"

data_split = mx.gluon.utils.split_and_load(batch[0], ctx_list=ctx, batch_axis=0, even_split=False)
label_split = mx.gluon.utils.split_and_load(batch[1], ctx_list=ctx, batch_axis=0, even_split=False)
outputs = [(net(X), Y) for X, Y in zip(data_split, label_split)]
# loss = ...

ThomasDelteil · November 14, 2018, 6:20pm

Indeed, it is a list. There is a DataParallel model in gluoncv.utils.parallel that hopefully will make its way to the main gluon codebase that will make this a lot simpler to the user.

@Hang_Zhang @zhreshold I can’t find it in the docs on gluon-cv.mxnet.io ? But it’s in the code base?

Topic		Replies	Views
Distributed Training / Model Parallelism with sparse embeddings in Gluon Gluon	2	536	June 19, 2019
Lower accuracy on Cifar10 with multi-gpu implementation	5	599	August 23, 2018
SageMaker CPU Training: Gradient of Parameter `lstnet0_conv0_weight` on context cpu(1) has not been updated by backward since last `step` Gluon	4	861	April 2, 2019
Multi GPU training - hidden state error Gluon	0	375	May 29, 2020
Unable to run sample code on GPU Gluon	7	3595	June 20, 2019

How to do multi-gpu training on public SageMaker gluon example?

Related Topics