Distributed training

gab · January 15, 2020, 4:22am

Hi

I try to train my neural net with more than one node on a cluster, which uses SLURM.

I try to do this exmaples: https://github.com/apache/incubator-mxnet/tree/master/example/distributed_training

This is my launch file:

#!/bin/bash

#SBATCH --job-name=“distCNN”

#SBATCH --time=00:05:00
#SBATCH --mem=8gb
#SBATCH --nodes=2
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2 # Number of GPUs (per node)
#SBATCH --array=1
#SBATCH --qos=express

module load pytorch/0.4.1-py36-cuda90
module load horovod/0.15.1-py36-cuda90
module load mxnet/1.4.1-py36-cuda90

./get_ip.sh > ip.txt

echo “all loaded”
python /home/user/tools/launch.py -n 2 -s 2 -H ip.txt --sync-dst-dir /home/user/run/ --launcher ssh “/home/user/run/python cifar10_dist.py”

The problem is that the python command is not executed. I get the output: “all loaded”, but then nothing, not even a error message. I also did a print line in the first line after main in the launch.py and cifar10_dist.py files. They don’t get executed. Does someone know what the error could be.

Cheers
Gab

Topic		Replies	Views
Cuda malloc when going distributed Gluon python	5	2021	April 9, 2019
Performance of distributed training using dist_sync kv_store Performance	1	473	March 13, 2020
Correct way to train Sequential() model on GPU Gluon	6	1165	February 10, 2021
Very slow initialisation of GPU distributed training Gluon	7	1307	September 7, 2020
Documentation Request: Model Parallelism Tutorial Performance	6	1849	March 10, 2018

Distributed training

Related Topics