Guidance for big data loading with MXNet


#1

I am designing a recommender system, which will train on user to item implicit interaction data. The size of the data is so large that it will not fit in memory. The label is binary & initial features will be categorical & continuous, however, in future the network should ingest images, text and sequential data etc.

It is critical I can train the model very quickly, which may necessitate training on a GPU cluster. Although initially I expect to get away with a large multi GPU instance.

I’m looking for guidance/links to examples on:

  1. where to store my data
  2. what format to store it in
  3. how to best feed my network

My research suggests recordIO is the best practice approach for storage format. This thread agrees, however, I’ve seen other threads mention using csv iterators or numpy memory maps. Furthermore, every use case I see is with images only.


#2

Here are some ideas for each of your questions.

  1. where to store my data?
    This depends on how large the data is. Will it fit on disk for a large multi GPU instance, if so this should be the preferred solution. Otherwise, you might want to consider S3 or other object storage mechanism

  2. what format to store it in?
    Like you said, for image data ImageRecordIO is a good idea. There may be similar compressed/optimized storage formats for text, sequential data but you need to factor in the costs incurred in converting from the default format your data comes in. I think the best bet in terms of performance is to load data in parallel using multiprocessing which you get, in the answer to question 3, by using a DataLoader and setting num_workers to the number of CPUs on the machine.

  3. how to best feed my network?
    You should probably use gluon.data.Dataset and gluon.data.DataLoader. See the tutorial here for more details.
    You can either define your custom dataset that extends gluon.data.Dataset or use one of the provided custom datasets like gluon.data.vision.datasets.ImageFolderDataset for raw images or gluon.data.vision.datasets.ImageRecordDataset for ImageRecordIO objects and define a DataLoader with the dataset option you choose to feed your network. You could also directly implement your custom data loader that where you only need to implement a __iter__ function that yields a batch of data. For text and sequential data set you can take a look at the gluonnlp.data.SimpleDatasetStream.