Guidance for big data loading with MXNet


I am designing a recommender system, which will train on user to item implicit interaction data. The size of the data is so large that it will not fit in memory. The label is binary & initial features will be categorical & continuous, however, in future the network should ingest images, text and sequential data etc.

It is critical I can train the model very quickly, which may necessitate training on a GPU cluster. Although initially I expect to get away with a large multi GPU instance.

I’m looking for guidance/links to examples on:

  1. where to store my data
  2. what format to store it in
  3. how to best feed my network

My research suggests recordIO is the best practice approach for storage format. This thread agrees, however, I’ve seen other threads mention using csv iterators or numpy memory maps. Furthermore, every use case I see is with images only.