I’m currently testing a pre-trained ResNet101 model and I’m interested in profile it and deploy it using CPU only. So my focus is on optimizing the model as much as possible.

I’ve installed MXNet with MKL support and everything seems to work properly. But anytime I set MXNET_SUBGRAPH_BACKEND=MKLDNN, inference time increase almost by 3 times compared to inference without subgraph optimization.
Moreover, I’ve noticed that disabling one specific type of fusion (MXNET_DISABLE_MKLDNN_FUSE_CONV_BN=1) solves the issue. Of course, by disabling the fusion, there’s no improvement on the inference time.

What could be the root cause of this issue?

I am not entirely sure why it happens, it could be a bug. But if you are interested in optimizing your model as much as possible, then take a look into TVM project - It can optimize MXNet models and it supports operators fusion. I have seen very nice performance improvement, especially with CPU that supports Skylake architecture (c5 instances at AWS).

Take a look at this example - . It optimizes the model for CUDA, but you can change the target to CPU. Depending on your processor architecture, you may want to try

target = "llvm -mcpu=skylake-avx512" # that will work with c5 instances of AWS


target = "llvm -mcpu=core-avx2" # that will work with c4 instances of AWS

assuming that you have compiled TVM with LLVM support.

Hi Sergey, thank you for your answer!
I’ve actually been working with TVM already and facing some issues there as well:

  • when using NNVM compiler and graph optimization, performance are the same as using mxnet-mkl
  • when using the new Relay instead, I face an issue similar to the one mentioned in this topic: when trying to fuse nodes (opt_level=3), inference time increase by almost 3 times

So I’m wondering if these two problems are related or not.
By the way, I’m working with a ResNet101 model. When using, for example, a MobileNet based model, the issue disappears and there’s actually an improvement in performance.

Any suggestion about any other framework I should try to speed up my model?

Unfortunately, I am not aware of anything else to improve model speed on CPU. But I guess TVM community would be surprised with the results you receive… Can you provide a small reproducible example?


Download the model from here and copy it to your working directory.
Make sure mxnet-mkl is installed in your python environment. Then run the the following code in the two different scenarios:

import mxnet as mx
import time

dev = mx.cpu()
num_batches = 10
model = 'model'

# load model
sym, arg_params, aux_params = mx.model.load_checkpoint(model, 0)
mod = mx.mod.Module(symbol=sym, context=dev)
mod.bind(for_training=False, data_shapes=[('data', (1, 3, 112, 112))], inputs_need_grad=False)
mod.set_params(arg_params, aux_params, allow_missing=True)

# get data
data = [mx.random.uniform(-1.0, 1.0, shape=shape, ctx=dev) for _, shape in mod.data_shapes]
batch =, [])  # empty label

# run
dry_run = 5  # use 5 iterations to warm up
for i in range(dry_run+num_batches):
    if i == dry_run:
        tic = time.time()
    mod.forward(batch, is_train=False)
    for output in mod.get_outputs():

# return time x image
timeximg = (time.time() - tic)/num_batches
print('Average img time: {}s'.format(timeximg))