Does SyncBatchNorm require peer access within GPUs

jianchao-li · September 24, 2018, 12:58pm

Recently I ran into a deadlock problem while training PSPNet on gluon-cv using SyncBatchNorm layers. The program was stuck randomly without throwing any error. I looked into the problem in the past days and I found this reply which suggested to do a sanity check using p2pBandwidthLatencyTest.

I ran the check and it showed me the following information.

Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

Then I found that only my GPUs 0 and 1 or 2 and 3 can talk to each other. I tried to run the program only using GPUs 0 and 1 and it worked (at least it is still running).

So I would like to know whether the SyncBatchNorm layer requires peer access within GPUs?

Hang_Zhang · September 24, 2018, 7:57pm

Hi, SyncBN does not require p2p access gpus, because the synchronization is using cpu. The communication is very small due to only synchronizing mean and variance.

jianchao-li · September 25, 2018, 2:57am

Hello, @Hang_Zhang, thank you for your reply.

Topic		Replies	Views
How GPU communicates with each other	2	965	July 13, 2018
Forward-backward pass being a bottleneck in multi-gpu training	3	1044	July 12, 2019
Gluon sync mode in single node? Gluon	1	322	November 7, 2018
Kvstore for distributed multi-gpu training Performance	10	2737	November 16, 2017
Cuda.is_available() in MXNet Discussion	4	649	November 8, 2019

Does SyncBatchNorm require peer access within GPUs

Related Topics