The complied mxnet from latest master branch is slower than pip installed version?

Hi,

In order to improve the performance, I clone the master branch and recompile from source based on the tutorial here: https://mxnet.incubator.apache.org/get_started/install.html. The make command I used was:

make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_PROFILER=1

After install the compiled mxnet, the train script ran even slower. Previously every 500 batches ran for about 3 minutes 55 seconds. The same script and same datasets ran for about 4 minutes 10 secs for the re-compiled version. I also uninstalled and re-compiled again without profiler enabled, but it did the same.

Then I uninstalled it, and installed mxnet-cu80 using pip install again. The speed was back to about 3 minutes 55 sec. The script uses single GPU to train neural network model. In term of the performance, is there any optimization that could be done for compiling from source?

Which version did you pip install?

Can you check if the pre release pip version also has the performance issue? If so, that could be a regression.

pip install mxnet-cu80 --pre --user