Hey @feevos,
The first thing you want to do is to make sure you are using mxnet-mkl so that you are taking advantage of the parallelization offered by mkldnn.
pip install mxnet-mkl
You can read more on this medium post: https://medium.com/apache-mxnet/accelerating-deep-learning-on-cpu-with-intel-mkl-dnn-a9b294fb0b9
From the article they suggest setting these env variables to get the maximum performance:
export KMP_AFFINITY=granularity=fine,compact,1,0
export vCPUs=`cat /proc/cpuinfo | grep processor | wc -l`
export OMP_NUM_THREADS=$((vCPUs / 2))
If the problem persists, try with:
export OMP_NUM_THREADS=`cat /proc/cpuinfo | grep processor | wc -l`