MXnet on yarn/spark


Has one used mxnet on yarn/spark to do distributed training?
I searched on the internet but only found on example of scala on spark, and a guide on aws for distributed inference.
I want to use python/pyspark to do distributed training(only cpu is ok), has any one tried this?


I have not tried this personally. However conceptually I don’t see why it cannot be done, specially using Gluon interface (rather than Module interface shown in the AWS article you referenced). You basically would do one forward/backward on each node. You do, however, need to take care of summing gradients over all nodes. Other than that, I feel that once the gradients are accumulated, you can do one optimization step and then repeat. Please post here if you successfully implement this.


@brillwang it can be used for distributed inference (though from my testing batch GPU inference blows it away for anything nontrivial) but for distributed training you’d need some way to gather the gradients and spark is pretty slow for anything that’s not batch data. You’re almost always better off with a couple of huge machines w/ GPUs than a large cluster w/ CPUs for this kind of thing anyways.