MXNet on Spark Roadmap

Azharuddin · February 28, 2018, 11:27am

dmlc#2256 makes MXNet on Spark possible. It works on a stable Spark cluster, but when it is brought to a complex environment, e.g., executors may fail and retry, multiple tasks may configured to run in one executor, etc.

Related to issue dmlc#1637 @tqchen

Here’s a roadmap for all those issues which may prevent from using MXNet on Apache Spark in production environment.

KVStore workers (and scheduler/servers) failover. I found that, when a worker fails and restarts, it is not able to connect the scheduler again. Is it expected for ps-lite? @mli
Timeout for ps scheduler and servers. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.
Multiple ps-lite threads in one process. Currently ps-lite is singleton, but Spark clusters can be configured to run multiple tasks in one executor (related to one process).
Make device resources transparent to application. Instead of specifying which GPU to use, users only need to know how many GPUs they want. This is important for clusters which use Yarn to do resource management.
Upload and distribute MXNet core library to all the worker nodes automatically.

simonco · September 19, 2018, 5:50pm

This discussion appears to have taken place on the dev@ alias.

Topic		Replies	Views
Multi-Threaded Inference Question	1	1007	July 4, 2019
MXnet on yarn/spark Discussion	2	802	May 2, 2018
Fixing thread safety issues in Scala library Discussion	12	3115	January 10, 2018
MXNet crashing, likely memory corruption Performance	9	2979	January 10, 2018
Restarting training running on multiple machines if one of the machine dies	3	468	October 11, 2017

MXNet on Spark Roadmap

Related Topics