MXNet Forum

Dataloader for Large dataset


#1

My dataset is too large to load them into the memory. So I just split them into pieces and load them one by one.
However, it cost much of time to load and reload data. It’s not a good way.

So I want to find some ways to load the data efficiently with multithreading. Is there any useful guide or api in mxnet?

By the way, my data is list of ndarrays instead of image. Each sample is a list of ndarray.


#2

You need to write your own CustomDataset, which needs only to provide an elementary load of an item by index using __getitem__() method. Instead of loading all items into the memory, it could contain a mapping of item index to item path on disk and load requested item on demand only. See an example how ImageFolderDataSet is doing this - it collects image paths in _list_images method and load and image only when it is actually needed.

Once you have your CustomDataset, you use it with the default DataLoader, and set num_workers attribute to a value greater than 0 - it will spin up that number of multiprocessing workers, which will use your CustomDataset. This is how you achieve multiprocessing in loading your data from your CustomDataset without actually doing multiprocessing yourself.

You can learn more how Dataset/DataLoader combination works in MXNet from this tutorial.