8x inference runtime difference between pip install and manual install


#1

I ran some quick informal benchmark on my code. Runtimes are in seconds.
Intel® Core™ i7-4700MQ CPU @ 2.40GHz

installation             inference runtime (s)
pip install mxnet-mkl                  0.8
pip install mxnet                      1.3
manual install (atlas, openmp)         4.2
manual install (atlas, lapack, openmp) 4.2
manual install (atlas)                 4.4
manual install (openblas)             10.8

manual installation = build from source (master)
pip mxnet = 1.1.0

I am surprised by the order of magnitude difference between pip install mxnet (which uses openblas) and manual installation with openblas.

Am I missing some obvious compilation flags? This was meant to be an informal benchmark, but I can get some reproducible code and control for mxnet version if there is a need for debugging.

relevant



#2

Hi @insilico,

I have unfortunately not been able to reproduce your issue:
Here is my benchmark code:

import mxnet as mx
print(mx.__version__, mx.__file__)
import time
from mxnet.gluon.model_zoo import vision
resnet18 = vision.resnet18_v1(pretrained=True)
data = mx.nd.ones((16, 3, 224, 224))
tick = time.time()
for i in range(10):
    resnet18(data).wait_to_read()
print("{0:.4f}".format(time.time()-tick))
  • mxnet 1.1.0: 10s
  • mxnet-mkl 1.1.0: 2s
  • mxnet-mkl --pre: 1.2s
  • mxnet --pre: 8s

Locally built:

  • latest master: make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_MKLDNN=1 1.1s
  • latest master: make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas 8s

Which is consistent with the pip installed version.

I am wondering whether your issue might come from your locally installed openblas ?


#3

Hey @ThomasDelteil

Thanks a lot for looking into this. I took your benchmark code and produced the following results:

Ubuntu 17.10 with 4.13.0-39-generic
Intel® Core™ i7-4700MQ CPU @ 2.40GHz

time      mxnet
10.6s     1.1.0         PyPI mxnet
15.7s     1.2.0         built with libopenblas-dev (0.2.20)
31.4s     1.2.0         built with libatlas-base-dev (3.10.3-5)

Both libopenblas-dev and libatlas-base-dev come from Ubuntu repository, reinstalled fresh for the above benchmark. Considering that I am building mxnet according to the official build instructions, I still find the discrepancy above surprising.

I could try to compile openblas from source to see if Ubuntu’s libopenblas-dev is at fault. Any other hints?

Edit 1: I have tried with openblas built from source, which gives the identical result as Ubuntu’s libopenblas-dev.
Edit 2: Is your local openblas built with openmp? Is PyPI mxnet (libmxnet.so) statically linked with openblas that itself is built with openmp?


#4

This is my configuration:
32-cores Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
OpenBLAS 0.2.20
Ubuntu 16.04.3 LTS

libmxnet.so comes statically linked with openblas
Openmp seems dynamically linked:

> ldd libmxnet.so
...
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f7b35500000)

@yizhiliu @szha any ideas?


#5

From @szha, possible suspects:

  • PyPi version comes with debug options off
  • OpenBlas is compiled with the following flags: DYNAMIC_ARCH=1 NO_SHARED=1 USE_OPENMP=1

#6

@ThomasDelteil
Thanks for your continued support. Your tips helped me confirm my hypothesis about openmp, and I have resolved the performance differences.

I realized that PyPI mxnet does not respect environment variables OMP_NUM_THREADS or MXNET_CPU_WORKER_NTHREADS. On my machine, the benchmark code always runs CPU@200% with PyPI mxnet.

MXNET with libopenblas-dev or manual installation of openblas respects ‘OMP_NUM_THREADS’ (as it should per openblas’ documentation) and when `OMP_NUM_THREADS’ is not defined, it uses $(nproc) threads (as documented in openblas).

I did not set OMP_NUM_THREADS, so the benchmark ran CPU@800% (8 logical cores on Intel i7). CPU@800% ran slower than CPU@200%, hence the observed discrepancy above.

Edit:
I should note that OMP_NUM_THREADS=2 makes manual installation of mxnet as fast as the PyPI version, and OMP_NUM_THREADS=4 makes it slightly faster on the benchmark code.


#7

@insilico, no problem. Thanks for sharing your findings! I’ll investigate on my side, I find it strange that the PyPI mxnet would not be respecting these ENV variables.