I have a large discrepancy in ResNet-18 run-time on Jetson TX2 for MXNet v0.9.5 vs. v0.11.0 when running with batch=1: 91 ms/fr vs. 124 ms/fr. nvvp shows that the same conv kernels are called and they take ~2x longer compared in case of v0.11.0. Do you have an idea why that might happen?
- Jetson TX2 with Jetpack 3.1 (cuDNN 6.0, CUDA 8.0)
- ResNet-18 based on resnet.py symbol available from repo
- input resolution 640x480
- batch = 1
running with batch = 8 results into similar run-time in both cases