I am trying to launch a distributed mxnet (gluon) program, following the tutorial here . I am working on an HPC cluster that uses the SLURM manager. This is my submit job file:
#!/bin/bash -l #SBATCH --job-name="DSTR" #SBATCH --job-name="DSTR" #SBATCH -t 2-00:01:30 #SBATCH --nodes=10 #SBATCH --cpus-per-task=28 #SBATCH --gres=gpu:4 #SBATCH --mem=128gb ./get_nodes_names.sh > workers_ip.txt ## This writes nodes in the ascii file srun python /data/dia021/Software/mxnet/tools/launch.py -n $(wc -l < workers_ip.txt) -s $(wc -l < workers_ip.txt) -H workers_ip.txt --sync-dst-dir /home/dia021/Projects/isprs_potsdam/distributed --launcher ssh "python main.py"
I get the following errors:
Can't load dmlc_tracker package. Perhaps you need to run git submodule update --init --recursive Traceback (most recent call last): File "/data/dia021/Software/mxnet/tools/launch.py", line 128, in <module> main() File "/data/dia021/Software/mxnet/tools/launch.py", line 96, in main args = dmlc_opts(args) File "/data/dia021/Software/mxnet/tools/launch.py", line 48, in dmlc_opts from dmlc_tracker import opts ModuleNotFoundError: No module named 'dmlc_tracker'
I have verified that I have password-less ssh connection with all nodes. The ascii file workers_ip.txt contains the names of the nodes that are available on the run, e.g.:
b031 b034 b035 b041 b042 b063 b064 b065 b066 b067
I am using a local python installation, that is seen by all machines. mxnet installed through
pip install mxnet-cu91 --pre
Do I need to install somehow components from the dlmc-core package? I can’t find instructions on how to do so. Tracking down the problem, it’s in mxnet/tools/launch.py line 48 (first import of dlmc_tracker).
mxnet tag: 1.3.0b20180621
Thank you very much!