Distributed Training of the Factorization Machine model is slow

I have launched a distributed training of the factorization machine model in incubator-mxnet/example/sparse/factorization_machine/ on a ssh cluster with 3 nodes via the following script:

../../../tools/launch.py -n 3 --launcher ssh -H hosts --sync-dst-dir /tmp/mxnet_job/fm/ python train.py --data-train criteo.kaggle2014.train.svm --data-test criteo.kaggle2014.test.svm --num-epoch 1 --kvstore dist_async --batch-size 1000

However, it was about 4 times slower than the regular local training launched via the following script:

python train.py --data-train criteo.kaggle2014.train.svm --data-test criteo.kaggle2014.test.svm --num-epoch 1 --batch-size 1000

Also, distributed training of other sparse models such as the linear classification model in incubator-mxnet/example/sparse/linear_classification/ is also slower than regular local training.

Are there any performance problems of distributed training of the factorization machine or any other sparse model?

Part of the profile.json file from the distributed training of the factorization machine model from one of the cluster nodes is as follows.