Load multiple rec files for shuffling and training


Hi, I have a very large dataset which has 500 GB data. I would like to pack them into multiple rec files. My questions are:

  1. Is it better to pack them into a single rec or multiple recs?
  2. To my best knowledge, MXNet only has support for loading one rec. Does it supports loading multiple recs and do shuffling and training?

Thank you for your answers in advance.



  1. imo the whole point of having a rec files is to store your data into one file file for faster reading so that should be preferred over multiple recordIO files.

  2. you can have multiple rec files by creating your custom Dataset class that extends gluon.data.Dataset and implements __getitem__ and __len__

For example:

class CustomCombinedDataset(gluon.data.Dataset):
    A dataset that accepts several dataset and serves
    them as one

    def __init__(self, datasets):

        self.datasets = datasets

        self.lengths = []
        start = 0
        for d in datasets:
            end = start + len(d)
            self.lengths.append((start, end))
            start = end

        self.length = sum([len(d) for d in datasets])

    def __getitem__(self, idx):
        current_running = 0
        for i, (start, end) in enumerate(self.lengths):
            print(start, end, idx)
            if idx >= end:
                current_running += end
                return self.datasets[i][idx - current_running]

    def __len__(self):
        return self.length

where each dataset in datasets is a gluon.data.RecordFileDataset