So it looks like the issue is related to the
After profiling the code I spotted large intermittent gaps in the processing of the batches. And according to the profiler, the backend (C++) wasn’t doing anything during those gaps, which indicates commands aren’t being queued fast enough by the frontend (Python). Usual cause of this is slow data loading or processing.
I was able to speedup training significantly by increasing the
num_workers to 8. And we get back to the usual situation where hybridization improves training speed!
Still, there’s a very strange bug occurring with
num_workers=2 which could explain what was happening before. I think it’s related to https://github.com/apache/incubator-mxnet/issues/13126 and https://github.com/apache/incubator-mxnet/pull/13318. Adding hybridization made the network faster, but this put extra strain on the dataloader, which lead to a multiprocessing clash somewhere along the way, thus making it slower than without hybridization.
My results of running your code on AWS EC2 p3.2xlarge (time of 1st epoch):
| | num_workers | time |
| Non-Hybridized | 0 | 19.59 |
| Hybridized | 0 | 18.26 |
| Non-Hybridized | 2 | 9.76 |
| Hybridized | 2 | 13.92 |
| Non-Hybridized | 8 | 8.90 |
| Hybridized | 8 | 7.25 |
So overall you should be able to get around a x2 speedup
num_workers=8 compared to
num_workers=2 for the hybridized network.