Batch formation from .rec files

olivcruche · January 17, 2019, 11:07pm

Couple .rec questions:

When forming batches from one .rec file with an mx.io iterator or a gluon DataLoader, are the batches randomly sampling from any location in the .rec? If yes how is this done efficiently, eg if the .rec is several dozens GB?
When forming batches from several .rec files with an io iterator or a gluon DataLoader, are the batches randomly sampling from any location in any files? Or will the first batches be sampled from the first file, then other batches sampled from the next file etc?

VishaalKapoor · January 19, 2019, 1:30am

There are logical partitions (“chunks”) and the shuffle_chunk_size argument specifies the size of the chunk in the shuffle. It is set to 64 MB by default in ImageRecordIOParser. This allows pre-fetching to occur. See the data loading architecture doc which describes this in detail and discusses ThreadIterators which have queues and their own threads to pre-fetch ahead of time.

The splitting occurs as an InputSplit (also in the data loading doc) and can logically span multiple files as below:

Hope that helps!

Vishaal

olivcruche · January 21, 2019, 8:43am

what exactly is the shuffle capability over a dataset of .rec files? Is the shuffle by chunk you mention shuffling (1) file order without shuffling in-file records or (2) shuffling in files, without shuffling file read order or (3) shuffling both file order and in-file records order?

VishaalKapoor · January 21, 2019, 7:48pm

In (2) Replace “file” by “part”.

The files are amalgamated and logically partitioned into parts of size defined by the chunk size. If the chunk size is 10M, the parts are 10M. The parts may start and end mid-instance as in the picture.

The parts are read sequentially, but within those parts the images are shuffled. I imagine there is some intelligent coding to manage the partial instances.

Vishaal

Topic		Replies	Views
Combining .rec files from im2rec's "chunk" option	0	401	May 25, 2021
Contradiction in .rec documentation	3	533	January 21, 2019
Read .rec into memory and get data stats Gluon	1	696	October 15, 2018
.rec ImageRecordIter returning different images than the original JPGs Discussion	3	982	November 21, 2017
How to load multi-rec files efficiently? Performance	1	773	April 3, 2019

Batch formation from .rec files

Related Topics