I mean, in my distributed training, I used perf to profile mxnet server role’s cpu cost, and got the following result. I guess mxnet server role used openmp for reducing gradient from workers. but I couldn’t figure out openmp’s caller about reducing gradient…
It looks like that there is an OpenMP imbalance: a lot of threads that are spinning on a barrier. Can you try to get the stacktraces with perf so that we can identify which function is causing that hotspot?
perf report --call-graph --stdio -G
You can also adjust the number of OpenMP threads via the environment variable
I had tried --call-graph option, but it could only show me the ‘start_thread’ function as following:
I had also tried to set OMP_NUM_THREADS=1, and kmp_barrier’s cost reduced as expected. I known kmp_barrier is used for threads synchronization, but I thought it should not cost so much and want to optimize.
The problem is that the threads are most of the time just waiting for work. How many threads are you running? Do you do asynchronous or synchronous distributed training?
There are 28 physical cpu cores(that is 56 logical cpus) in my machine, I used 28 openmp threads as recommended.
I thought I was training with sync SGD.