Can't load dmlc_tracker package


#1

Dear all,

I am trying to launch a distributed mxnet (gluon) program, following the tutorial here . I am working on an HPC cluster that uses the SLURM manager. This is my submit job file:

#!/bin/bash -l

#SBATCH --job-name="DSTR"
#SBATCH --job-name="DSTR"
#SBATCH -t 2-00:01:30
#SBATCH --nodes=10
#SBATCH --cpus-per-task=28
#SBATCH --gres=gpu:4
#SBATCH --mem=128gb


./get_nodes_names.sh > workers_ip.txt  ## This writes nodes in the ascii file

srun python /data/dia021/Software/mxnet/tools/launch.py  -n $(wc -l < workers_ip.txt) -s  $(wc -l < workers_ip.txt) -H workers_ip.txt --sync-dst-dir /home/dia021/Projects/isprs_potsdam/distributed  --launcher ssh "python main.py"

I get the following errors:

Can't load dmlc_tracker package.  Perhaps you need to run
    git submodule update --init --recursive
Traceback (most recent call last):
  File "/data/dia021/Software/mxnet/tools/launch.py", line 128, in <module>
    main()
  File "/data/dia021/Software/mxnet/tools/launch.py", line 96, in main
    args = dmlc_opts(args)
  File "/data/dia021/Software/mxnet/tools/launch.py", line 48, in dmlc_opts
    from dmlc_tracker import opts
ModuleNotFoundError: No module named 'dmlc_tracker'

I have verified that I have password-less ssh connection with all nodes. The ascii file workers_ip.txt contains the names of the nodes that are available on the run, e.g.:

b031
b034
b035
b041
b042
b063
b064
b065
b066
b067

I am using a local python installation, that is seen by all machines. mxnet installed through

pip install mxnet-cu91 --pre 

Do I need to install somehow components from the dlmc-core package? I can’t find instructions on how to do so. Tracking down the problem, it’s in mxnet/tools/launch.py line 48 (first import of dlmc_tracker).

mxnet tag: 1.3.0b20180621

Thank you very much!
Foivos


#2

update: Did the following in my local software installation

cd /home/feevos/Software
git clone https://github.com/dmlc/dmlc-core.git
cd dmlc-core
make 

then I added the following line in my .bashrc

export PYTHONPATH=$PYTHONPATH:/home/feevos/Software/dmlc-core/tracker

Now I can import line48 in launch.py:

from dmlc_tracker import opts

in python so I expect that the ~/mxnet/tools/launch.py should work. Will test 2morrow and edit this response.

This community motivates me! :slight_smile:
Cheers


#3

Update:
I’m able to reproduce this issue. I’ve filed an issue.


Glad you got it working. But you shouldn’t have to install dmlc-core manually. I’ll try to reproduce this and let you know.


#4

Thank you for your help @indu.


#5

pip package will include dmlc_tracker starting from tonight’s nightly build.


#6

Thank you guys , you are an awesome team!