Rcnn forward slow during distributed training 0.12


#1

I am using rcnn code from examples. But when I am running the cold with kvstore with distributed training. even for two worker on the same machine . the forward operation take 10 times more to complete compare to when running train_end2end.py with kvstore set to device. wondering if anyone else ran into the same problem. Or where to start digging.

Thanks

AW


#2

If you compile MXNet from source with USE_PROFILER=1 flag, then you can profile the code and see what’s going on. It could provide some hints
https://mxnet.incubator.apache.org/how_to/perf.html#profiler.

But two workers on the same machine can be slower, because each worker wouldn’t have all the resources on that machine.


#3

i met the same issue, seems the bottleneck is updating the parameters. Maybe it’s the net issue.
do you have any progress on this issue?


#4

No luck. We tried to increase the netwowrk bandwidth seems to help
=)


#5

what’s your input? is it .rec file or jpeg file?