Batch formation from .rec files


#1

Couple .rec questions:

  1. When forming batches from one .rec file with an mx.io iterator or a gluon DataLoader, are the batches randomly sampling from any location in the .rec? If yes how is this done efficiently, eg if the .rec is several dozens GB?
  2. When forming batches from several .rec files with an io iterator or a gluon DataLoader, are the batches randomly sampling from any location in any files? Or will the first batches be sampled from the first file, then other batches sampled from the next file etc?

Contradiction in .rec documentation
#2

There are logical partitions (“chunks”) and the shuffle_chunk_size argument specifies the size of the chunk in the shuffle. It is set to 64 MB by default in ImageRecordIOParser. This allows pre-fetching to occur. See the data loading architecture doc which describes this in detail and discusses ThreadIterators which have queues and their own threads to pre-fetch ahead of time.

The splitting occurs as an InputSplit (also in the data loading doc) and can logically span multiple files as below:

Hope that helps!

Vishaal


#3

what exactly is the shuffle capability over a dataset of .rec files? Is the shuffle by chunk you mention shuffling (1) file order without shuffling in-file records or (2) shuffling in files, without shuffling file read order or (3) shuffling both file order and in-file records order?


#4

In (2) Replace “file” by “part”.

The files are amalgamated and logically partitioned into parts of size defined by the chunk size. If the chunk size is 10M, the parts are 10M. The parts may start and end mid-instance as in the picture.

The parts are read sequentially, but within those parts the images are shuffled. I imagine there is some intelligent coding to manage the partial instances.

Vishaal