I am getting my hands dirty with asynchronous distributed training. All looks good, and this suggested tutorial is awesome. Based on this, I have few questions, perhaps someone can help.
This function splits the data:
class SplitSampler(gluon.data.sampler.Sampler): """ Split the dataset into `num_parts` parts and sample from the part with index `part_index` Parameters ---------- length: int Number of examples in the dataset num_parts: int Partition the data into multiple parts part_index: int The index of the part to read from """ def __init__(self, length, num_parts=1, part_index=0): # Compute the length of each partition self.part_len = length // num_parts # Compute the start index for this partition self.start = self.part_len * part_index # Compute the end index for this partition self.end = self.start + self.part_len def __iter__(self): # Extract examples between `start` and `end`, shuffle and return them. indices = list(range(self.start, self.end)) random.shuffle(indices) return iter(indices) def __len__(self): return self.part_len
The case where the data cannot be evenly divided with the number of workers is not capture from this function. The
DataLoaderprovides the option
last_batchbut I don’t know how this works with sampler in the distributed setting. I can always modify the
self.endindices to accurately describe the dataset but I don’t know how this will behave in the distributed setting. For example, from the function above I see that shuffling is always happening in the partition of data that belongs to a specific machine (
part_index). So I must always shuffle within the range of indices that belong to a particular worker.
Since this is going to be a multiple machines/multiple gpu training, I am going to use
dist_async(and not `dist_device_async’ - is this correct? based on the discussion here)
A general question between modes
dist_async: it is my understanding that
dist_syncis used as in a single machine training context that uses multiple machines to increase the batch size (and aggregate gradients from all machines). So dist_sync is used when we want to increase the batch size. Now
dist_asyncwhen used, updates the weights in each worker independently from other machines. So there is no increase in batch size for the gradient evaluation, however due to many machines, the model trains faster (as many more updates, as machines available). Is this correct? Could please someone verify? I am a bit confused based on the guidelines here.
Thank you very much for your time and apologies for silly questions (the inner geek speaks within me )