Mxnet prediction on docker


#1

Hi all.

I experience significant decrease in mxnet prediction performance (2.5-4.5 increase in time for 1000 images) when running it within docker container (mxnet/python:1.1.0 was used with default configuration). Could you point me to some documentation about best practices of mxnet/python docker container deployment from performance point of view?

For details of tested cases, see https://github.com/apache/incubator-mxnet/issues/11139


#2

Are you using the mxnet-mkl version of MXNet? How are you measuring execution time? I find running in a container can sometimes have a bit of an overhead on first run, for example some python libraries being compiled etc. Are you measuring with time.time() in the script or from the shell execution?


#3

Hi @ThomasDelteil,
Thank you for your response.

I use linux time utility from shell (real time). But actually I don’t think it matters because the difference is not in seconds/milliseconds, it’s in minutes. I see the approximately the same time for more active CPU usage in datadog monitoring for this server.

It is also not about first time execution. I repeated tests several times both inside and outside docker container.

Regarding mxnet-mkl, I’m not sure. I think, not. Inside docker it is version used by mnxet/python:1.1.0. Outside installed with the command pip3 install mxnet==1.1.0

Here are results of mxnet diagnose utility:
Pure OS - https://gist.github.com/olga-gorun/895bd09c39aae363121b3836764c0d4e#file-pure_os_diagnoze_results-txt
Inside docker - https://gist.github.com/olga-gorun/895bd09c39aae363121b3836764c0d4e#file-docker_diagnoze_results-txt

Could you help me with better way to find answer to the question about mxnet-mkl?


#4

Looking at the docs I don’t think we publish a mxnet-mkl docker image. You can read this post: https://medium.com/apache-mxnet/accelerating-deep-learning-on-cpu-with-intel-mkl-dnn-a9b294fb0b9 on using MKL-DNN with MXNet. You can just install it using pip install mxnet-mkl.
Can you try this image: 1.2.0_cpu and see if you get a similar performance drop?

I will try to reproduce and see if I get the same performance issues. Did you use a specific benchmark script for your numbers? If yes can you point me to it?


#5

Yes. There are 4 scripts with different scenarios. Let’s speak, for example, about

Without docker it took 7.99min, inside docker - 36.62min (average numbers).

You can find more details about these results here:

look at ‘sequential one by one’ scenario results.

I’ll look at link you provided and check image for 1.2.0_cpu


#6

You can also check to see whether you have mxnet-mkl by using the ldd command on libmxnet.so.

If you are using mkl you should see something like this in the output;

...
libmklml_intel.so => /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/mxnet/./libmklml_intel.so (0x00007fb7d19ac000)
...

libmxnet.so is in the site-packages folder for mxnet, i.e. same directory as;

import mxnet as mx
mx.__file__

You’ll also be using libmkldnn too if you see that listed.

Another reference that might be useful are the docker containers used by AWS SageMaker (which runs MXNet inside docker containers). You can find their Dockerfiles in this repository.


#7

Here are my updates.

  1. Current installations were definitely without mkl (@thomelane, thank you for way to check it)
  2. Changing mxnet version to 1.2.0 doesn’t help (almost the same difference)
  3. A lot thanks for idea about mkl-dnn. It drastically changes performance in both pure OS and docker variants and the difference doesn’t exist in this case (in both cases I installed mnxet-mkl --pre that corresponds to 1.3.0 woth mkl-dnn). I tried script bulk.py

So the problem for production is solved. But just to understand situation better. Any suggestions what can explain this difference? @ThomasDelteil, did you have a chance to make your tests to see whether you can reproduce such behaviour?


#8

Thanks @olga.gorun for following-up. I was indeed able to reproduce your performance issue.

I asked the person that is publishing the docker images and it seems that they are built from source, using less optimized version of opencv and openblas than the one we push on pip. They are in the process of revamping that pipeline to get consistent performance, ETA end of June.

You can track the github issue here:

Thanks for raising this issue.

Going forward, I recommend using your own docker image with a simple installation of mxnet using the pip install mxnet-mkl package. After the revamp is completed, their should be a docker image with the mkl build available.


#9

@ThomasDelteil, thank you for quick answer. Yes, building my own docker that installs mxnet with pip is the way I did it to see changes. The only thing that I don’t like currently my Dockerfile includes

pip3 install mxnet-mkl --pre

It doesn’t mention concrete version, so, if I understand right, result version may change with time. Is there concrete version I can point to?


#10

You have multiple choices:
You can run:
pip3 install mxnet-mkl==1.2.0 which is the latest official release.

if you run pip3 install mxnet-mkl== you will see a list of possible nightly builds

(from versions: 0.9.5, 0.10.0, 0.10.0.post2, 0.11.0b20170815, 0.11.0b20170820, 0.11.0, 0.11.1b20170828, 0.11.1b20170906, 0.11.1b20170913, 0.11.1b20170920, 0.11.1b20170927, 0.11.1b20171004, 0.11.1b20171011, 0.12.0b20171018, 0.12.0b20171030, 0.12.0, 0.12.1b20171105, 0.12.1b20171119, 0.12.1b20171126, 0.12.1b20171203, 0.12.1, 1.0.0b20171210, 1.0.0, 1.0.0.post0, 1.0.0.post1, 1.0.0.post2, 1.0.0.post4, 1.0.1b20171231, 1.0.1b20180107, 1.0.1b20180114, 1.0.1b20180121, 1.0.1b20180128, 1.0.1b20180202, 1.0.1b20180203, 1.0.1b20180204, 1.0.1b20180205, 1.0.1b20180206, 1.1.0b20180207, 1.1.0b20180208, 1.1.0b20180209, 1.1.0b20180210, 1.1.0b20180211, 1.1.0b20180212, 1.1.0b20180213, 1.1.0b20180214, 1.1.0b20180215, 1.1.0, 1.2.0b20180317, 1.2.0b20180318, 1.2.0b20180320, 1.2.0b20180321, 1.2.0b20180322, 1.2.0b20180323, 1.2.0b20180324, 1.2.0b20180325, 1.2.0b20180326, 1.2.0b20180327, 1.2.0b20180328, 1.2.0b20180329, 1.2.0b20180330, 1.2.0b20180331, 1.2.0b20180401, 1.2.0b20180402, 1.2.0b20180403, 1.2.0b20180404, 1.2.0b20180405, 1.2.0b20180406, 1.2.0b20180407, 1.2.0b20180408, 1.2.0b20180409, 1.2.0b20180410, 1.2.0b20180411, 1.2.0b20180412, 1.2.0b20180413, 1.2.0b20180414, 1.2.0b20180415, 1.2.0b20180416, 1.2.0b20180417, 1.2.0b20180418, 1.2.0b20180419, 1.2.0b20180420, 1.2.0b20180421, 1.2.0b20180422, 1.2.0b20180423, 1.2.0b20180424, 1.2.0b20180425, 1.2.0b20180426, 1.2.0b20180427, 1.2.0b20180428, 1.2.0b20180429, 1.2.0b20180430, 1.2.0b20180501, 1.2.0b20180502, 1.2.0b20180503, 1.2.0b20180504, 1.2.0b20180505, 1.2.0b20180506, 1.2.0b20180507, 1.2.0b20180508, 1.2.0b20180509, 1.2.0b20180510, 1.2.0b20180511, 1.2.0b20180512, 1.2.0b20180513, 1.2.0b20180514, 1.2.0b20180515, 1.2.0b20180516, 1.2.0b20180518, 1.2.0b20180520, 1.2.0b20180522, 1.2.0b20180523, 1.2.0b20180524, 1.2.0b20180525, 1.2.0b20180526, 1.2.0b20180527, 1.2.0, 1.3.0b20180528, 1.3.0b20180529, 1.3.0b20180530, 1.3.0b20180531, 1.3.0b20180601, 1.3.0b20180602, 1.3.0b20180603, 1.3.0b20180604, 1.3.0b20180605)
No matching distribution found for mxnet-mkl==

in the 1.3.0b20180605 => 1.3.0.bYYYYMMDD format

if you build using pip3 install mxnet-mkl==1.3.0b20180605 you will point to this morning’s build. However I am not sure how long we keep them up there, definitely for a couple of months but maybe not more than that.

That’s why I recommend that you use pip3 install mxnet-mkl==1.2.0, verify that it works for your use-case and then stick with that. There are plenty of performance improvement being added to mxnet mkldnn integration atm, but I would wait for the release of 1.3.0 to start using them in production.