I am using rcnn code from examples. But when I am running the cold with kvstore with distributed training. even for two worker on the same machine . the forward operation take 10 times more to complete compare to when running train_end2end.py with kvstore set to device. wondering if anyone else ran into the same problem. Or where to start digging.