Hi, documentation mentions two types of
local: all gradients are copied to CPU memory and weights are updated there
device: both gradient aggregation and weight updates are run on GPUs
In local mode, in which CPU is the operation done: CPU of the machine hosting the GPUs? or CPU of one or several machine doing only KVStore job?