In order to improve the performance, I clone the master branch and recompile from source based on the tutorial here: https://mxnet.incubator.apache.org/get_started/install.html. The make command I used was:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_PROFILER=1
After install the compiled mxnet, the train script ran even slower. Previously every 500 batches ran for about 3 minutes 55 seconds. The same script and same datasets ran for about 4 minutes 10 secs for the re-compiled version. I also uninstalled and re-compiled again without profiler enabled, but it did the same.
Then I uninstalled it, and installed mxnet-cu80 using pip install again. The speed was back to about 3 minutes 55 sec. The script uses single GPU to train neural network model. In term of the performance, is there any optimization that could be done for compiling from source?