Question about Distribution Training using launcher.py

Dear All,

I am following the official document to build my DL cluster on MXNET but meet a strange question when trying to train using command:

…/…/tools/launch.py -n 2 -H hosts --launcher ssh --sync-dst-dir /tmp/mxnet_job/ python image_classification.py --dataset cifar10 --model vgg11 --epochs 1 --kvstore dist_sync

Server and worker processes on both machine are set up normally, but the training never begin. It just get stuck with this logging info:

INFO:root:Starting new image-classification task:, Namespace(…)
INFO:root:Starting new image-classification task:, Namespace(…)

I add some log command in train() function but it will not be called.
When I train with 1 machine (also use --launcher ssh, not local), everything goes fine.

Could anyone please help me about this question? Thanks a lot!

Most probably machines cannot connect to each other. I would recommend to:

  1. Set logs verbosity by setting environment variable PS_VERBOSE to 2 and see if you can find 4 (2 for workers, 1 for scheduler, 1 for parameter server) lines containing BARRIER_REACHED . See this video for details on that: https://www.youtube.com/watch?v=DOAnj36i6ho

  2. Check your hosts file, and put direct ip addresses there for test.

  3. See if TCP ports can be open - MXNet opens almost arbitrary TCP ports to connect machines together.

1 Like

Thanks for you precious answer!
I have tried to open all TCP ports and setting the PS_VERBOSE to 2.

Then I see the logging command in train() is called successfully, and there are barrier command from 2 server and 2 worker with rank (8,9,10,11). (just like the video shows)

However, when the training begin, I can see the info about decoding the data set, then the program is get stuck.

Here is the last part of the logging info:

[05:08:12] src/van.cc:398: 10 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 1
[05:08:12] src/van.cc:373: ? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:398: 1 => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 2
[05:08:12] src/van.cc:398: 8 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 3
[05:08:12] src/van.cc:398: 11 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 4
[05:08:12] src/van.cc:398: 9 => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 5
[05:08:12] src/van.cc:373: ? => 9. Meta: request=0, timestamp=5, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:373: ? => 11. Meta: request=0, timestamp=6, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:373: ? => 8. Meta: request=0, timestamp=7, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:373: ? => 10. Meta: request=0, timestamp=8, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:373: ? => 1. Meta: request=0, timestamp=9, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:398: 1 => 1. Meta: request=0, timestamp=9, control={ cmd=BARRIER, barrier_group=1 }
[05:08:12] src/van.cc:373: ? => 1. Meta: request=1, timestamp=10, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:398: 1 => 1. Meta: request=1, timestamp=10, control={ cmd=BARRIER, barrier_group=7 }
[05:08:12] src/van.cc:183: Barrier count for 7 : 1
INFO:root:test logger
INFO:root:[debug]running get_model()
INFO:root:[debug]Ready to call main()
INFO:root:[debug]Calling main
INFO:root:Calling train()
[05:08:12] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar/train.rec, use 1 threads for decoding…
[05:08:13] src/io/iter_image_recordio_2.cc:170: ImageRecordIOParser2: data/cifar/test.rec, use 1 threads for decoding…
[05:08:14] src/van.cc:398: 11 => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=4 }
[05:08:14] src/van.cc:183: Barrier count for 4 : 1

Interesting. Does it work if you run the same code in a non-distributed environment?