Poor inference latency with MKLDNN on CPU

We are upgrading our model hosting production system from MXNet 1.1 (built with MKL but no MKLDNN) to MXNet 1.3 (built with both MKL and MKLDNN and with MXNET_MKLDNN_ENABLED=1 as default). We noticed significant latency degradation from 20 ms/doc to 100 ms/doc with MKLDNN enabled. But after disable MKLDNN by setting MXNET_MKLDNN_ENABLED=0, we observed latency was back to 20 ms/doc.

We also tested the LSTM-PTB model (in mxnet-model-server example) in the same environment. With that model, we actually found that MKLDNN did improve latency instead of making it worse. Therefore, we suspected the latency problem we saw was probably due to how we build our model.

Our model used the following Symbol operators that seemingly not supported in MKLDNN (according to http://intel.github.io/mkl-dnn/):

mx.sym.SliceChannel()
mx.sym.ElementWiseSum()
mx.sym.Activation(..., act_type="sigmoid")
mx.sym.broadcast_mul()
mx.sym.Dropout()

We did use other operators such as mx.sym.Concat() or mx.sym.FullyConnected() or mx.sym.SoftmaxActivation(), but those seem to be supported by MKLDNN and we assume they shouldn’t be a problem.

My questions are:

  1. Is the latency issue caused by our model using operators such as mx.sym.broadcast_mul() that are seemingly not supported by MKLDNN?
  2. For short term, should we just disable MKLDNN and use MKL instead?
  3. For long term, should we expect something like https://github.com/apache/incubator-mxnet/issues/13598 would fix our problem by using MKLDNN for the supported operators and falls back to MKL for the other operators?

Thanks!

Thank you for reporting this. @danithaca Would you mind providing a reproducer for this issue? From the description seems the problem is not related to issue #13598 .

@TaoLv - Unfortunately our model/code is proprietary and cannot share here. Sorry about that.

But regardless, if an operator is not supported by MKLDNN but MKLDNN is enabled, what happens when that operator executes? Would using the operator hurt performance?

Not sure what does the “not supported by MKLDNN” mean. If it means a operator is not supported by MKL-DNN at primitive level (eg. dropout), it will run into the original CPU implementation. If it means a operator is not supported by MKL-DNN at implementation level (eg. 3D Convolution is not enabled for MKL-DNN backend yet), it will be checked at runtime and fallback to the original CPU implementation of 3D Convolution.

The only performance concern here is if a MKL-DNN operator (eg. Conv2D) is followed by a Non-MKL-DNN operator (either not supported at primitive level or not supported at implementation level), there might be (not always) a reorder operation to change data format from MKL-DNN internal format to MXNet default format (eg. NCHW). This reorder operation may take some time, but usually it’s negligible compared with gain from MKL-DNN Conv2D.

That’s why I need a reproducer or proxy demo case to reproduce the issue.

Thanks @TaoLv. It helps a lot. Our code was written in mid-2016, and we implemented our own version of vanilla LSTM. We didn’t take advantage of the newer RNN Symbolic API or Gluon RNN API, which might have a better implementation w.r.t. performance. I checked our code, and we do have operators such as mx.sym.Dropout(), mx.sym.SliceChannel(), etc (assuming they are Non-MKL-DNN operators either at primitive level or at implementation level) mixed with operators such as mx.sym.Concat() or mx.sym.FullyConnected() (assuming they are MKL-DNN operators). I imagine the reorder operation could happen a lot in our case.

Also, we didn’t use any Convolution, and we didn’t use the LSTM MKL-DNN operator (as mentioned, we implemented our own version of LSTM cell and unroll using mxnet SliceChannel, Concat, sigmoid activation, etc). I guess that means the gain of using MKL-DNN would not offset the loss with reorder operation.

My last question is – When you say “fallback to the original CPU implementation”, does it mean it still uses MKL (although not MKL-DNN) or not using MKL at all? We did built MXNet with MKL.

Thanks.

Sorry for late response, @danithaca. Do you think https://github.com/apache/incubator-mxnet/blob/master/example/rnn/bucketing/lstm_bucketing.py is representative for your model. If yes, we can do some basic analysis on it.

Given your model doesn’t use any Convolutional layer, I would not expect any format reorder happens there. So for me:

  • FullyConnected, concat, activation might run into MKL-DNN
  • dropout, SliceChannel, elementwise_add/mul will run into original CPU implementation

My last question is – When you say “fallback to the original CPU implementation”, does it mean it still uses MKL (although not MKL-DNN) or not using MKL at all? We did built MXNet with MKL.

If the operator is implemented with MKL BLAS, then yes it will fall back to it if MKL-DNN is not supported.