No noticable speed improvement with higher compute capability

Hi,

I am using MXNet C++ API to train a neural network. And when I compile MXNet (1.2.0 from a git clone) with different compute capability and architecture, I was expecting a performance boost using higher compute capability (Cuda 9.2). However, I did not see such things.
What could explain that the speed of my neural network computation (a FCN) does not evolve from a compute capability of 3.0 to 7.2? The FCN is using only float32 computations.

I tried both with or without CuDNN. (CuDNN in “NaiveEngine” mode.)

Thanks!

Hi @dmidge,

Are you definitely measuring the computation time and not the operation queuing time? I’ve seen a lot of people write their own benchmarking code where they don’t wait for the computation to finish before stopping their timer, because they didn’t use nd.waitall(). You might also want to try the MXNet profiler for more accurate statistics on the time of each operator, that way you can avoid including data transfer overheads in the comparision.

Hi @thomelane,
Well, I have all the data synchronization done with SyncToCpu, which effectively does a waitall to copy, in synchronous mode, the data to a standard C++ array.
However, my benchmark where including the data transfer overhead. But if the difference of computation time is not noticeable compared to the data transfer, that would mean in practice that the optimisation in computation has no effect for me?
I would need to investigate to know if it is the case.

Ok, based on what I saw, the data transfer between each batch seems to be the bottleneck which explains why it doesn’t go faster. It seems to be waiting for the data to load. However, it still seems that the compute capability change doesn’t increase the speed of the computation.

Hi,

I improved the benchmark, and it seems that the performance was IO bound, so not much that the computation could have done to improve to improve the speed.
However, I still have the issue that I don’t understand: I still don’t see any improvement of the computation speed when I compile MXNet with an higher compute capability. So the issue remains. What could explain that?

Thanks!

Hi @dmidge,

Could you share your benchmarking code? That would be helpful to understand what you are measuring. You are right that most of the time the mistake people make is to write their code in I/O bound way without taking advantage of the multiprocessing dataloaders.

Also can you describe what you mean by “compile MXNet with an higher compute capability”?
Are you talking about the versions of cuda, cudnn, mkldnn or something else?

Well, on this case, the benchmarking code is not very useful, since I am IO bound rather than computation bound. However, I was using the MXNet profiler and comparing the computation time displayed at every batch. And it is basically the same, but I can clearly see a difference whenever I do the same computations with cuDNN enabled (compared to without).

Whenever I was talking about the higher compute capability, I mean, keeping the same version of cuda, we can specify, compilation time, the version of “compute capability” and the designed architecture versions with the following parameters “compute_60,sm_60”.