Distributed training port bind error

multi-host
linux
docs
#1

Thanks for reading my post. I have come across this problem in both official tutorial (image classification)and simple examples (https://gluon.mxnet.io/chapter07_distributed-learning/training-with-multiple-machines.html).
##Error messge##
Traceback (most recent call last):
File “store.py”, line 3, in
store = kv.create(‘dist’)
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore.py”, line 674, in create
ctypes.byref(handle)))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node
.port) != (-1) bind failed
#######################

##Discription##
I am trying distributed training on two ubuntu server. Both of them have one GPU,but this may not be the problem.

I installed mxnet-cu90 with pip. and I also git cloned mxnet(https://github.com/apache/incubator-mxnet) to my home directory.

The command is simple “~/incubator-mxnet/tools/launch.py -H host -n 2 python3 my_prog.py”

host
"
server1
server2
"
both of them are sshable without password

##MORE Error Info##
Traceback (most recent call last):
File “store.py”, line 3, in
store = kv.create(‘dist’)
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore.py”, line 674, in create
ctypes.byref(handle)))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node
.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x39008a) [0x7fbbb075408a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x3906c1) [0x7fbbb07546c1]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31d1f3a) [0x7fbbb3595f3a]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31dbe3a) [0x7fbbb359fe3a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31cd129) [0x7fbbb3591129]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c80258) [0x7fbbb3044258]
[bt] (6) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c80916) [0x7fbbb3044916]
[bt] (7) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(MXKVStoreCreate+0x20) [0x7fbbb2e3e950]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7fbc2996fe20]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7fbc2996f88b]

Exception in thread Thread-5:
Traceback (most recent call last):
File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
self.run()
File “/usr/lib/python2.7/threading.py”, line 754, in run
self.__target(*self.__args, **self._kwargs)
File “/home/envy/incubator-mxnet/tools/…/3rdparty/dmlc-core/tracker/dmlc_tracker/ssh.py”, line 62, in run
subprocess.check_call(prog, shell = True)
File “/usr/lib/python2.7/subprocess.py”, line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no pub1 -p 22 'export DMLC_PS_ROOT_URI=192.168.20.217; export DMLC_ROLE=wo
r; export DMLC_PS_ROOT_PORT=9100; export DMLC_NUM_WORKER=2; export DMLC_NODE_HOST=pub1; export DMLC_NUM_SERVER=2; cd /home/envy/dis

t/; python3 store.py’’ returned non-zero exit status 1

Traceback (most recent call last):
File “store.py”, line 3, in
store = kv.create(‘dist’)
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore.py”, line 674, in create
ctypes.byref(handle)))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node
.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x39008a) [0x7f63db54108a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x3906c1) [0x7f63db5416c1]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31d1f3a) [0x7f63de382f3a]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31dbe3a) [0x7f63de38ce3a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31cd129) [0x7f63de37e129]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c80258) [0x7f63dde31258]
[bt] (6) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c80916) [0x7f63dde31916]
[bt] (7) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(MXKVStoreCreate+0x20) [0x7f63ddc2b950]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f6454732e20]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f645473288b]

Traceback (most recent call last):
File “store.py”, line 2, in
from mxnet import kv, nd
File “/usr/local/lib/python3.5/dist-packages/mxnet/init.py”, line 57, in
from . import kvstore_server
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 85, in
_init_kvstore_server_module()
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 82, in _init_kvstore_server_module
server.run()
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node
.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x39008a) [0x7f7de8a9408a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x3906c1) [0x7f7de8a946c1]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31d1f3a) [0x7f7deb8d5f3a]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31dbe3a) [0x7f7deb8dfe3a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31cd129) [0x7f7deb8d1129]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c8cb53) [0x7f7deb390b53]
[bt] (6) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88) [0x7f7deb17f7f8]
[bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f7e61c85e20]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f7e61c8588b]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7f7e61c8001a]

Exception in thread Thread-4:
Traceback (most recent call last):
File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
self.run()
File “/usr/lib/python2.7/threading.py”, line 754, in run
self.__target(*self.__args, **self.__kwargs)
File “/home/envy/incubator-mxnet/tools/…/3rdparty/dmlc-core/tracker/dmlc_tracker/ssh.py”, line 62, in run
subprocess.check_call(prog, shell = True)
File “/usr/lib/python2.7/subprocess.py”, line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command ‘ssh -o StrictHostKeyChecking=no server1 -p 22 ‘export DMLC_PS_ROOT_URI=192.168.20.217; export DMLC_ROLE
rker; export DMLC_PS_ROOT_PORT=9100; export DMLC_NUM_WORKER=2; export DMLC_NODE_HOST=server1; export DMLC_NUM_SERVER=2; cd /home/env
is_test/; python3 store.py’’ returned non-zero exit status 1

Exception in thread Thread-2:
Traceback (most recent call last):
File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
self.run()
File “/usr/lib/python2.7/threading.py”, line 754, in run
self.__target(*self.__args, **self.__kwargs)
File “/home/envy/incubator-mxnet/tools/…/3rdparty/dmlc-core/tracker/dmlc_tracker/ssh.py”, line 62, in run
subprocess.check_call(prog, shell = True)
File “/usr/lib/python2.7/subprocess.py”, line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command ‘ssh -o StrictHostKeyChecking=no server1 -p 22 ‘export DMLC_PS_ROOT_URI=192.168.20.217; export DMLC_ROLE
rver; export DMLC_PS_ROOT_PORT=9100; export DMLC_NUM_WORKER=2; export DMLC_NODE_HOST=server1; export DMLC_NUM_SERVER=2; cd /home/env
is_test/; python3 store.py’’ returned non-zero exit status 1

Traceback (most recent call last):
File “store.py”, line 2, in
from mxnet import kv, nd
File “/usr/local/lib/python3.5/dist-packages/mxnet/init.py”, line 57, in
from . import kvstore_server
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 85, in
_init_kvstore_server_module()
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 82, in _init_kvstore_server_module
server.run()
File “/usr/local/lib/python3.5/dist-packages/mxnet/kvstore_server.py”, line 73, in run
check_call(_LIB.MXKVStoreRunServer(self.handle, _ctrl_proto(self._controller()), None))
File “/usr/local/lib/python3.5/dist-packages/mxnet/base.py”, line 251, in check_call
raise MXNetError(py_str(LIB.MXGetLastError()))
mxnet.base.MXNetError: [16:33:33] src/van.cc:291: Check failed: (my_node
.port) != (-1) bind failed

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x39008a) [0x7f518efed08a]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x3906c1) [0x7f518efed6c1]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31d1f3a) [0x7f5191e2ef3a]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31dbe3a) [0x7f5191e38e3a]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x31cd129) [0x7f5191e2a129]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2c8cb53) [0x7f51918e9b53]
[bt] (6) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(MXKVStoreRunServer+0x88) [0x7f51916d87f8]
[bt] (7) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f5208208e20]
[bt] (8) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f520820888b]
[bt] (9) /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a) [0x7f520820301a]

Exception in thread Thread-3:
Traceback (most recent call last):
File “/usr/lib/python2.7/threading.py”, line 801, in __bootstrap_inner
self.run()
File “/usr/lib/python2.7/threading.py”, line 754, in run
self.__target(*self.__args, **self._kwargs)
File “/home/envy/incubator-mxnet/tools/…/3rdparty/dmlc-core/tracker/dmlc_tracker/ssh.py”, line 62, in run
subprocess.check_call(prog, shell = True)
File “/usr/lib/python2.7/subprocess.py”, line 541, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no pub1 -p 22 'export DMLC_PS_ROOT_URI=192.168.20.217; export DMLC_ROLE=se
r; export DMLC_PS_ROOT_PORT=9100; export DMLC_NUM_WORKER=2; export DMLC_NODE_HOST=pub1; export DMLC_NUM_SERVER=2; cd /home/envy/dis

t/; python3 store.py’’ returned non-zero exit status 1

Thanks again for your help.

#2

It is likely a problem with the port. Can you check if the port is open? You can use the command netstat -nlp or netstat -an | grep $PORTNUMBER

#4

Thanks first.
However it may not be the problem of the port.
I tried “netstat -apn |grep 9100” found that the port is just used by the python program itself. Acctually , I found that ,the launch.py will automatically find an unused port to set the connection.