Distribute training


#1

I’m facing a trouble when i try training mnist by distributing on mxnet. and i got the following information:

Traceback (most recent call last):
File “train_mnist.py”, line 25, in
from common import find_mxnet, fit
File “/home/lm/incubator-mxnet/example/image-classification/common/find_mxnet.py”, line 24, in
import mxnet as mx
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/init.py”, line 25, in
from . import engine
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/engine.py”, line 23, in
from .base import _LIB, check_call
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/base.py”, line 29, in
import numpy as np
ImportError: No module named numpy
Traceback (most recent call last):
File “train_mnist.py”, line 25, in
from common import find_mxnet, fit
File “/home/lm/incubator-mxnet/example/image-classification/common/find_mxnet.py”, line 24, in
import mxnet as mx
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/init.py”, line 25, in
from . import engine
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/engine.py”, line 23, in
from .base import _LIB, check_call
File “/home/lm/incubator-mxnet/example/image-classification/common/…/…/…/python/mxnet/base.py”, line 29, in
import numpy as np
ImportError: No module named numpy
Exception in thread Thread-3:
Traceback (most recent call last):
File “/home/lm/anaconda3/envs/lm2/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
self.run()
File “/home/lm/anaconda3/envs/lm2/lib/python2.7/threading.py”, line 754, in run
self.__target(*self.__args, **self.__kwargs)
File “/home/lm/incubator-mxnet/tools/…/dmlc-core/tracker/dmlc_tracker/ssh.py”, line 61, in run
subprocess.check_call(prog, shell = True)
File “/home/lm/anaconda3/envs/lm2/lib/python2.7/subprocess.py”, line 186, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command ‘ssh -o StrictHostKeyChecking=no lm@10.10.143.238 -p 22 ‘export LD_LIBRARY_PATH=/home/lm/PSPNet/build/lib:/usr/local/cuda-8.0/lib64; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9108; export DMLC_PS_ROOT_URI=10.10.143.108; export DMLC_NUM_SERVER=1; export DMLC_NUM_WORKER=1; cd /home/lm/incubator-mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_device_sync’’ returned non-zero exit status 1

training mnist on a single machine is no problem, i can import numpy in python when training on a single machine.
what can i do?


#2

Can you copy paste your entire code please?


#4

Thank you for your reply.
Here are my entire code. I’m already cd to the ~/incubator-mxnet/example/image-classification dir. Then run
python …/…/tools/launch.py -n 1 --launcher ssh -H hostfile python train_mnist.py --network lenet --kv-store dist_device_sync
or
python …/…/tools/launch.py -n 2 --launcher ssh -H hostfile python train_mnist.py --network lenet --kv-store dist_device_sync
the content of host file is one ip address or two ip address.
but i got the same problem.
I had debugged the code, the main problem is that i can not import numpy in base.py, but when i training on a single machine(not distribution) i can import numpy. What did i do wrong?


#5

Do you confirm that each machine is configured with python, MXNet, numpy, etc?