Distributed Gluon HybridBlock is much much slower than Symbol

This is an ongoing thread with a few people on MXNet but distributed Gluon HybridBlock is way slower than keras-mxnet. Benchmarking an MLP - we have a 4x slowdown using Gluon than we did using Keras:

The entire epoch time for Keras is 60s using 8 GPUs:

48641254/48641254 [==============================] - 62s - loss: 0.4282 - val_loss: 0.3570 
Epoch 2/10 
48641254/48641254 [==============================] - 61s - loss: 0.4074 - val_loss: 0.3546 
Epoch 3/10 
48641254/48641254 [==============================] - 61s - loss: 0.4058 - val_loss: 0.3537 
Epoch 4/10 
48641254/48641254 [==============================] - 61s - loss: 0.4048 - val_loss: 0.3533 

For Gluon, 1000 batches takes about 224s using 8 GPUs and proper hybridization:

Epoch [0]: Interval [0/6000] Train-QLMeanMetric:  Speed: 53.17s
Epoch [0]: Interval [1000/6000] Train-QLMeanMetric:  Speed: 224.43s
Epoch [0]: Interval [2000/6000] Train-QLMeanMetric:  Speed: 227.38s

Before we get into code, is there any reference implementation for Distributed Hybrid Block other than this. We’ve tried a bunch of things - but none seem to work, including inherting the loss from gluon.loss and overriding the hybrid_forward function.

Is there a source or documentation on how to debug the massive slowdown and find the bottleneck?

Did you call hybridize?

net.hybridize()
loss.hybridize()