System crushes when running mxnet-ssd learning on multiple GPUs


#1

Hi, everyone!

Tried to run SSD architecture learning from the main repo (https://github.com/apache/incubator-mxnet) on machine with multiple GPUs (four GTX 1080 ti). System behaives unstable with multiple GPUs - the computer just resets its power and restarts with system error logs empty. When I reduce the batch_size to 16 and use only 2 GPUs of four it works stable and slow.

The machine configuration is as follows:

  • CPU Intel Core X-series i9-7900X
  • Gigabyte GTX 1080 Ti Founders Edition, PCI Express, 11GB GDDR5, 352 bit - 4 pieces
  • ASUS ROG RAMPAGE VI EXTREME
  • DDR4 16Gb Patriot Memory - 4 pieces
  • W0431RE Thermaltake Baikal 1500W
  • SSD 1Tb Samsung MZ-V6P1T0BW
  • SSD 512 Gb Samsung 960

All gpu burn down tests are passed OK - with max memory load and temperature all four GPU work stable for hours.

Trying to run some Caffe SSD learning scripts to figure out if it is framework specific.

What can cause such hard reboots? Any clues?

Cheers!