Recently I ran into a deadlock problem while training PSPNet on gluon-cv
using SyncBatchNorm
layers. The program was stuck randomly without throwing any error. I looked into the problem in the past days and I found this reply which suggested to do a sanity check using p2pBandwidthLatencyTest
.
I ran the check and it showed me the following information.
Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2
Then I found that only my GPUs 0 and 1 or 2 and 3 can talk to each other. I tried to run the program only using GPUs 0 and 1 and it worked (at least it is still running).
So I would like to know whether the SyncBatchNorm
layer requires peer access within GPUs?