Does SyncBatchNorm require peer access within GPUs


#1

Recently I ran into a deadlock problem while training PSPNet on gluon-cv using SyncBatchNorm layers. The program was stuck randomly without throwing any error. I looked into the problem in the past days and I found this reply which suggested to do a sanity check using p2pBandwidthLatencyTest.

I ran the check and it showed me the following information.

Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

Then I found that only my GPUs 0 and 1 or 2 and 3 can talk to each other. I tried to run the program only using GPUs 0 and 1 and it worked (at least it is still running).

So I would like to know whether the SyncBatchNorm layer requires peer access within GPUs?


#2

Hi, SyncBN does not require p2p access gpus, because the synchronization is using cpu. The communication is very small due to only synchronizing mean and variance.


#3

Hello, @Hang_Zhang, thank you for your reply.