Num_workers should be picked so that it’s equal to number of cores on your machine for max parallelization. If you increase it further you start to incur costs due to the overhead of context switching by the OS.
thanks @sad It’s what I thought - I used to actually use num_workers = multiprocessing.cpu_count() to make this scale up or down with machine hardware but in the numbers above I’m in p3.8xl with 32vcpu, so I’m surprised that using 16 ends up in slower training than with 8.
Hi, the number of cores used for optimisation is something you need to finetune based on the particular application in hand. More cores is not always faster, it really depends on the load each worker/cpu has to do. It all comes down in a trade-off between communication cost vs computation cost. I recall from the past that (e.g.) intel TBB was using some kind of internal algorithm to automatically decide the optimum number of cores needed for best performance on a specific job. Similar line of reasoning I’ve followed in the past parallelizing for loops with OpenMP.