Efficiently accessing arbitrary NDArray batches in mxnet

Fitting convnets such as Resnet and VGG benefits from the ImageRecordIter python class, that allows efficiently loading batches from large collections of RGB images stored in RecordIO .rec files,

Does anybody know about equivalent facilities for large arbitrary input 2D or 3D matrices (for 2D, rows = items and cols = features, + channels in 3D)?

NDArrayIter requires loading the whole dataset in memory, which is to be avoided in my case (>40Gb data file). CSVIter does not allow straightforward shuffling, and works only for 2D matrices.

You can develop your own data iterator. Check out this tutorial

Without extensive details, let’s say my input data is generated by expansion out of some simpler (not image-y) data table.

I already created my own iterator by adapting SimpleIter (from https://mxnet.incubator.apache.org/tutorials/basic/data.html) to my needs. Its next method generates the expected data batch on the fly from the simpler table.

The problem is that this is highly inefficient - a learning algorithm using this iterator overall spends almost all its time in the next method.

Implementing on the basis of mxnet.recordio.MXRecordIO (saving expanded .rec file, and loading it when learning) looks like the way to go then, indeed - as confirmed by your link. However adapting it to a general NDArray context seems to require a good deal of implementation, while multi-threaded ready-to-use facilities are available for image collections.

So instead of going for reinventing the wheel right away, my question was rather if I missed the equivalent of im2rec.py + ImageRecordIter for general NDArrays, or if indeed the only way was DIY.

I am not sure if this is the only way. When I had a similar problem, my approach was on the basis of mxnet.recordio.MXRecordIO