Speedup inference / multithread inference Python

performance
python
#1

Hi,

I read some comments about the multithread inference and generally is not good news .

how can I use the full power of the cpus of the edge machine to submit multiple input for inferencing?

Each get_feature(inference) took almost 1 sec in raspberry and I need to wait 5 seconds to get 5th input image while cpu is only %15 - 20 .

as you know we have limited with 1gig Ram. what is the best approach to get multiple inference at one time? (python)

#2

MXNet’s engine is already running multihthreaded, unless you turned off OpenMP, MKL, Blas.

In order to speed up the inference you could have a look at NNPACK https://github.com/apache/incubator-mxnet/blob/master/docs/faq/nnpack.md NNPACK is an acceleration package for neural network computations. It can run on x86-64, ARMv7, or ARM64 architecture CPUs and can speed up execution on multi-core CPUs.

You could try TensorRT runtime integration in MXNet, however this is currently still an experimental feature. https://cwiki.apache.org/confluence/display/MXNET/How+to+use+MXNet-TensorRT+integration

You could try to compile your MXNet model with TVM: https://docs.tvm.ai/tutorials/nnvm/from_mxnet.html which can speedup the inference.

It is also important to check whether there are performance bottlneck such as I/O. How do you load your data for inference? Is it stored in a file, if so which file format are you using?

#3

The input image coming from mqtt to directly to my python code in raspberry pi 3b+.

After that extracting features and compering the distance. I am not sure about the openMP , an blas. I compiled with blas and openMP . I will check again.

extracting feature looks one thread. comparing the distance of the result feature is fast.

i will look tvm and let you know.

best

#4

@NRauschmayr

is there any way to see current mxnet compiled details whether include above libs?

#5

Did you install MXNet with CMake? If so you could check CMakeCache.txt. Alternatively you could also do ldd libmxnet.so. If MXNet was compiled with OpenMP, then the library will show up. You should see a line like libomp.so => /usr/local/lib/libomp.so.

You mentioned that the input image is coming from mqtt directly. I could imagine that you may have some delays there e.g. waiting for the next message, then preprocessing of the image and postprocessing of the result. One way of optimizing this is to have a separate reader process, that gathers all the incoming messages and creates a batch of images, that you can feed into your model. If the distance computation is not part of your model, then this could also be done in a separate process.

#6

@NRauschmayr . result is :

ldd /home/pi/berryconda3/lib/python3.6/site-packages/mxnet/libmxnet.so
linux-vdso.so.1 (0x7ed7d000)
/usr/lib/arm-linux-gnueabihf/libarmmem.so (0x74b09000)
libgfortran.so.3 => /usr/lib/arm-linux-gnueabihf/libgfortran.so.3 (0x74a33000)
libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x742a3000)
librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0x7428c000)
libstdc++.so.6 => /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 (0x74144000)
libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0x740c5000)
libgomp.so.1 => /usr/lib/arm-linux-gnueabihf/libgomp.so.1 (0x7408d000)
libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0x74060000)
libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0x74037000)
libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0x73ef8000)
/lib/ld-linux-armhf.so.3 (0x76fa0000)
libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0x73ee5000)