Gluon implementation much slower than Symbolic

My MxNet.module implementation runs at:
Speed: 716.69 samples/sec rmse=0.176675

While my Gluon implementation runs at (after rum hybridize()):
speed: 267.994337 samples/s, training: rmse=0.2200

Is there anything else I can do to help?

There should be no reason why Gluon would run slower, unless something isn’t properly configured. Without knowing details of your implementation, I can only give you some tips:

  • Make sure your network is captured in a HybridBlock end to end
  • If using DataLoader, make sure num_workers is set to your number of CPUs available
  • When calling hybridize() set static_alloc=True and static_shape=True (i.e. hybridize(static_alloc=True, static_shape=True) )
  • Monitor your CPU/GPU usage in both module and gluon modes to get insight into what might be causing the slow down.

Could you share your implementation?

Here is my code:
def build_net(hyper_params):
from mxnet.gluon.model_zoo import vision as models

res_net = models.resnet152_v1(pretrained=True)
new_net = gluon.nn.HybridSequential()

with new_net.name_scope():
    pretrained_features = res_net.features
    new_tail = gluon.nn.HybridSequential()
    new_tail.add(
        gluon.nn.Dense(hyper_params.NUM_HIDDENS1, activation=hyper_params.ACTIVATION),
        gluon.nn.Dropout(hyper_params.DROPOUT),
        gluon.nn.Dense(hyper_params.NUM_HIDDENS2, activation=hyper_params.ACTIVATION),
        gluon.nn.Dropout(hyper_params.DROPOUT),
        gluon.nn.Dense(hyper_params.NUM_OUTPUTS)
    )
    new_tail.initialize(mx.init.Xavier(magnitude=hyper_params.MAGNITUDE))

    new_net.add(
        pretrained_features,
        new_tail
    )
return new_net

bin_net = build_net(hyper_params)
bin_net.hybridize()

This looks ok to me, did you try what @safrooze suggested

  • Make sure your network is captured in a HybridBlock end to end
  • If using DataLoader , make sure num_workers is set to your number of CPUs available
  • When calling hybridize() set static_alloc=True and static_shape=True (i.e. hybridize(static_alloc=True, static_shape=True) )
  • Monitor your CPU/GPU usage in both module and gluon modes to get insight into what might be causing the slow down.

Does not work for me:

MXNetError: Cannot find argument ‘static_shape’, Possible Arguments:

inline_limit : int (non-negative), optional, default=2
Maximum number of operators that can be inlined.
forward_bulk_size : int (non-negative), optional, default=15
Segment size of bulk execution during forward pass.
backward_bulk_size : int (non-negative), optional, default=15
Segment size of bulk execution during backward pass.

The training code:
def forward_backward(net, data, label, metric):
losses, outputs = ,
with autograd.record():
for X, Y in zip(data, label):
Z = net(X)
losses.append(loss(Z, Y))
outputs.append(Z)
for l in losses:
l.backward()
metric.update(label, outputs)
return losses

I am using MxNet 1.1 could it be the cause of the problem?

static_alloc is only on master branch which will be released soon as part of 1.3.0 release. However static_alloc only helps with the last 10% gap between Gluon and Symbolic. How much difference in performance are you seeing?

Gluon is less than half of Symbolic. :frowning:

If you are using 1.1, then there’s

  • No multi-workers for data loader, so IO will be the bottleneck
  • No optimizations of memory related stuff in 1.2 and 1.3

The IO is the root cause I think.