Raised the issue here: https://github.com/apache/incubator-mxnet/issues/9087
We’ve received a couple NVIDIA Titan V (Volta) cards and experimenting with half precision and without half precision we’re seeing marginal performance improvement with half-precision (dtype = float16) set, also tried with Titan X (Pascal), although we didn’t expect half precision to work on Pascal architecture.
This was tested with release 1.0.0
Running on a machine with CUDA 9.0 + CUDNN 7.0.5
To reproduce, one epoch on resnet for CIFAR10 script:
time python2 train_cifar10.py --dtype float16 --network resnet --num-epochs 1 --num-layers 110 --batch-size 512 --gpus 0
for Titan V (Volta) we’re getting:
~2700 samples/sec with half precision on, and ~2900 samples/sec when off. Which I believe should be the opposite if anything.
Also we’re not getting a massive speed improvement between the Titan X (Pascal) and Titan V (Volta).
for Titan X (Pascal) we’re getting:
~2600 samples/sec with half precision on, and ~2228 samples/sec when off.
The performance improvement on the Titan X (Pascal) is much better.