Question about distributed Synchronous and Asynchronous training


The brief version of my question is in a distributed asynchronous (dist_async), do workers wait for parameter server finishing updating wights before pulling, or workers just pull the current weights on parameter server?
From the description of dist_async, it seems workers do not wait, as this says. But I indeed find that the asynchronous mode takes more time to pull weights down than synchronous.

I have 4 virtual machines, one for parameter and the other three for workers. Of course, I start a mxnet cluster with one parameter server and three workers.
I’m running an image classification example with a gluon library in a distributed environment, with async and sync. It’s an example in mxnet/example/gluon/ I use profiler to get the operation trace.
And compared the averaged pull time.
The first figure is sync, the second is async.

It makes sense the sync takes longer time on push. But why the async takes more time on the pull, if async workers need no waiting?

Thanks in advance for answering!