Does SyncBatchNorm require peer access within GPUs

Recently I ran into a deadlock problem while training PSPNet on gluon-cv using SyncBatchNorm layers. The program was stuck randomly without throwing any error. I looked into the problem in the past days and I found this reply which suggested to do a sanity check using p2pBandwidthLatencyTest.

I ran the check and it showed me the following information.

Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

Then I found that only my GPUs 0 and 1 or 2 and 3 can talk to each other. I tried to run the program only using GPUs 0 and 1 and it worked (at least it is still running).

So I would like to know whether the SyncBatchNorm layer requires peer access within GPUs?

Hi, SyncBN does not require p2p access gpus, because the synchronization is using cpu. The communication is very small due to only synchronizing mean and variance.

1 Like

Hello, @Hang_Zhang, thank you for your reply.