No noticable speed improvement with higher compute capability

dmidge · April 5, 2019, 2:52pm

Hi,

I am using MXNet C++ API to train a neural network. And when I compile MXNet (1.2.0 from a git clone) with different compute capability and architecture, I was expecting a performance boost using higher compute capability (Cuda 9.2). However, I did not see such things.
What could explain that the speed of my neural network computation (a FCN) does not evolve from a compute capability of 3.0 to 7.2? The FCN is using only float32 computations.

I tried both with or without CuDNN. (CuDNN in “NaiveEngine” mode.)

Thanks!

thomelane · April 5, 2019, 6:30pm

Hi @dmidge,

Are you definitely measuring the computation time and not the operation queuing time? I’ve seen a lot of people write their own benchmarking code where they don’t wait for the computation to finish before stopping their timer, because they didn’t use nd.waitall(). You might also want to try the MXNet profiler for more accurate statistics on the time of each operator, that way you can avoid including data transfer overheads in the comparision.

dmidge · April 8, 2019, 2:20pm

Hi @thomelane,
Well, I have all the data synchronization done with SyncToCpu, which effectively does a waitall to copy, in synchronous mode, the data to a standard C++ array.
However, my benchmark where including the data transfer overhead. But if the difference of computation time is not noticeable compared to the data transfer, that would mean in practice that the optimisation in computation has no effect for me?
I would need to investigate to know if it is the case.

dmidge · April 9, 2019, 3:54pm

Ok, based on what I saw, the data transfer between each batch seems to be the bottleneck which explains why it doesn’t go faster. It seems to be waiting for the data to load. However, it still seems that the compute capability change doesn’t increase the speed of the computation.

dmidge · April 12, 2019, 3:24pm

Hi,

I improved the benchmark, and it seems that the performance was IO bound, so not much that the computation could have done to improve to improve the speed.
However, I still have the issue that I don’t understand: I still don’t see any improvement of the computation speed when I compile MXNet with an higher compute capability. So the issue remains. What could explain that?

Thanks!

ThomasDelteil · April 12, 2019, 4:50pm

Hi @dmidge,

Could you share your benchmarking code? That would be helpful to understand what you are measuring. You are right that most of the time the mistake people make is to write their code in I/O bound way without taking advantage of the multiprocessing dataloaders.

Also can you describe what you mean by “compile MXNet with an higher compute capability”?
Are you talking about the versions of cuda, cudnn, mkldnn or something else?

dmidge · April 16, 2019, 2:34pm

Well, on this case, the benchmarking code is not very useful, since I am IO bound rather than computation bound. However, I was using the MXNet profiler and comparing the computation time displayed at every batch. And it is basically the same, but I can clearly see a difference whenever I do the same computations with cuDNN enabled (compared to without).

Whenever I was talking about the higher compute capability, I mean, keeping the same version of cuda, we can specify, compilation time, the version of “compute capability” and the designed architecture versions with the following parameters “compute_60,sm_60”.

Topic		Replies	Views
Mxnet with cudnn7.1 is little slower than cudnn5.1 Discussion	1	407	July 13, 2018
Mxnet 1.3.1: speed/performance differences between the mxnet gluon and module/symbol APIs of at least a factor of 2 Performance	11	1378	February 27, 2019
Huge performance decrease by quantization Performance	4	987	June 4, 2019
MXNet vs Pytorch Benchmark Performance	3	2250	May 27, 2019
Hybrid training speed is 20% slower than pytorch Performance	5	1326	January 11, 2019

No noticable speed improvement with higher compute capability

Related Topics