I try to train my neural net with more than one node on a cluster, which uses SLURM.
I try to do this exmaples: https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training
This is my launch file:
#SBATCH --gres=gpu:2 # Number of GPUs (per node)
module load pytorch/0.4.1-py36-cuda90
module load horovod/0.15.1-py36-cuda90
module load mxnet/1.4.1-py36-cuda90
./get_ip.sh > ip.txt
echo “all loaded”
python /home/user/tools/launch.py -n 2 -s 2 -H ip.txt --sync-dst-dir /home/user/run/ --launcher ssh “/home/user/run/python cifar10_dist.py”
The problem is that the python command is not executed. I get the output: “all loaded”, but then nothing, not even a error message. I also did a print line in the first line after main in the launch.py and cifar10_dist.py files. They don’t get executed. Does someone know what the error could be.