Dataloader for Large dataset


#1

My dataset is too large to load them into the memory. So I just split them into pieces and load them one by one.
However, it cost much of time to load and reload data. It’s not a good way.

So I want to find some ways to load the data efficiently with multithreading. Is there any useful guide or api in mxnet?

By the way, my data is list of ndarrays instead of image. Each sample is a list of ndarray.


#2

You need to write your own CustomDataset, which needs only to provide an elementary load of an item by index using __getitem__() method. Instead of loading all items into the memory, it could contain a mapping of item index to item path on disk and load requested item on demand only. See an example how ImageFolderDataSet is doing this - it collects image paths in _list_images method and load and image only when it is actually needed.

Once you have your CustomDataset, you use it with the default DataLoader, and set num_workers attribute to a value greater than 0 - it will spin up that number of multiprocessing workers, which will use your CustomDataset. This is how you achieve multiprocessing in loading your data from your CustomDataset without actually doing multiprocessing yourself.

You can learn more how Dataset/DataLoader combination works in MXNet from this tutorial.


#3

Hi, I have the same problem, have you solved it?


#4

Hi, I cannot find the relevant approach of loading the data only when it is needed in the ImageFolderDataSet example. Could you show it more detailed, thanks in advance.


#5

Hi richard,

With the Dataset/DataLoader API, you don’t have to worry about manually loading the data when it is needed, the API takes care of that for you. i.e you just need to create a DataLoader python class either by specifying how to load the data from disk or by passing in a gluon.DataSet object. In your training loop the DataLoader workers will load the data into memory for a particular batch.

If you already have an ImageFolderDataset or any superclass of gluon.DataSet then all you need to do is create a DataLoader that accepts the dataset and use that in your training loop.