Location of KVStore update


#1

Hi, documentation mentions two types of kvstore:
local: all gradients are copied to CPU memory and weights are updated there
device: both gradient aggregation and weight updates are run on GPUs

In local mode, in which CPU is the operation done: CPU of the machine hosting the GPUs? or CPU of one or several machine doing only KVStore job?


#2

Hi @olivcruche,

When you’re using local mode the ‘CPU of the machine hosting the GPU’ and ‘CPU of the machine running the KVStore’ are the same thing. You’ll want to set the kvstore to dist_sync to distribute the kvstore across multiple machines, and even then it’s typical for the kvstore to run on the same machines that are used for training.


#3

thank Thom,
in this tuto https://aws.amazon.com/fr/blogs/machine-learning/scalable-multi-node-deep-learning-training-using-gpus-in-the-aws-cloud/ I understand that kvstore is hosted on seperate, non-training m4 instances. Can such a thing be achieved with the launch.py tool?


#4

Ah yes, thanks for the link to clarify! So with launch.py it looks like you can achieve this but the order of the hosts will be important. Say you have 5 machines in total (all added to a hosts file), and want 2 parameter servers and 3 workers (to calculate gradients), you should specify;

launch.py -s 2 -n 3

Servers are assigned first, so first two hosts in the hosts file will be assigned as parameter servers (e.g. m4 instances) and the next 3 machines will be the workers (e.g. p3 instances). If you wanted all of the machines to be parameter servers and workers you could specify;

launch.py -s 5 -n 5