Distributed training


I try to train my neural net with more than one node on a cluster, which uses SLURM.

I try to do this exmaples: https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training

This is my launch file:


#SBATCH --job-name=“distCNN”

#SBATCH --time=00:05:00
#SBATCH --mem=8gb
#SBATCH --nodes=2
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2 # Number of GPUs (per node)
#SBATCH --array=1
#SBATCH --qos=express

module load pytorch/0.4.1-py36-cuda90
module load horovod/0.15.1-py36-cuda90
module load mxnet/1.4.1-py36-cuda90

./get_ip.sh > ip.txt

echo “all loaded”
python /home/user/tools/launch.py -n 2 -s 2 -H ip.txt --sync-dst-dir /home/user/run/ --launcher ssh “/home/user/run/python cifar10_dist.py”

The problem is that the python command is not executed. I get the output: “all loaded”, but then nothing, not even a error message. I also did a print line in the first line after main in the launch.py and cifar10_dist.py files. They don’t get executed. Does someone know what the error could be.