dmlc#2256 makes MXNet on Spark possible. It works on a stable Spark cluster, but when it is brought to a complex environment, e.g., executors may fail and retry, multiple tasks may configured to run in one executor, etc.
Related to issue dmlc#1637 @tqchen
Here’s a roadmap for all those issues which may prevent from using MXNet on Apache Spark in production environment.
- KVStore workers (and scheduler/servers) failover. I found that, when a worker fails and restarts, it is not able to connect the scheduler again. Is it expected for ps-lite? @mli
- Timeout for ps scheduler and servers. PS scheduler and servers are started in a spawned process. But when something goes wrong, e.g., workers, servers or the scheduler crash, they will have no chance to stop themselves. I think the simplest way to solve this problem is to set a timeout T, if the scheduler/server does not receive any message in T seconds, stop itself.
- Multiple ps-lite threads in one process. Currently ps-lite is singleton, but Spark clusters can be configured to run multiple tasks in one executor (related to one process).
- Make device resources transparent to application. Instead of specifying which GPU to use, users only need to know how many GPUs they want. This is important for clusters which use Yarn to do resource management.
- Upload and distribute MXNet core library to all the worker nodes automatically.