Dear all,
I am trying to launch a distributed mxnet (gluon) program, following the tutorial here . I am working on an HPC cluster that uses the SLURM manager. This is my submit job file:
#!/bin/bash -l
#SBATCH --job-name="DSTR"
#SBATCH --job-name="DSTR"
#SBATCH -t 2-00:01:30
#SBATCH --nodes=10
#SBATCH --cpus-per-task=28
#SBATCH --gres=gpu:4
#SBATCH --mem=128gb
./get_nodes_names.sh > workers_ip.txt ## This writes nodes in the ascii file
srun python /data/dia021/Software/mxnet/tools/launch.py -n $(wc -l < workers_ip.txt) -s $(wc -l < workers_ip.txt) -H workers_ip.txt --sync-dst-dir /home/dia021/Projects/isprs_potsdam/distributed --launcher ssh "python main.py"
I get the following errors:
Can't load dmlc_tracker package. Perhaps you need to run
git submodule update --init --recursive
Traceback (most recent call last):
File "/data/dia021/Software/mxnet/tools/launch.py", line 128, in <module>
main()
File "/data/dia021/Software/mxnet/tools/launch.py", line 96, in main
args = dmlc_opts(args)
File "/data/dia021/Software/mxnet/tools/launch.py", line 48, in dmlc_opts
from dmlc_tracker import opts
ModuleNotFoundError: No module named 'dmlc_tracker'
I have verified that I have password-less ssh connection with all nodes. The ascii file workers_ip.txt contains the names of the nodes that are available on the run, e.g.:
b031
b034
b035
b041
b042
b063
b064
b065
b066
b067
I am using a local python installation, that is seen by all machines. mxnet installed through
pip install mxnet-cu91 --pre
Do I need to install somehow components from the dlmc-core package? I can’t find instructions on how to do so. Tracking down the problem, it’s in mxnet/tools/launch.py line 48 (first import of dlmc_tracker).
mxnet tag: 1.3.0b20180621
Thank you very much!
Foivos