Contradiction in .rec documentation


Hi, there are two contradicting statements in the recordIO documentation

  1. “Do the packing once. We don’t want to repack data every time run-time settings, like the number of machines, are changed”
  2. “We don’t need to consider distributed loading issue at the preparation time, just select the most efficient physical file number according to the dataset size and computing resources available.”

Consequently proper usage of .rec is not clear: how config-specific should .rec dataset be? should we hyper-parametrize number of files and cross-validate it every time run-time settings change?

Related question Batch formation from .rec files



Hi @olivcruche

Ultimately you’d be writing your .recs once instead of for each k-fold cross validation.

I believe you’re asking a similar question as to where someone mentioned chunking in combination to k-fold cross validation. Note, I don’t believe using chunking this way is correct, as chunking is a performance parameter for pre-loading (for example, it’s limited to be within 4MB and 4096MB for ImageRecordIO). Instead it sounds like you’d like to implement a KFold iterator like GroupKFold in sklearn.

I see several options, in theory: you could re-create your record io for each split (not recommended), or create several ImageRecordIO iterators each for different splits (these are not guaranteed to be non-overlapping) as described in issue 1252, or do something non-overlapping like the GroupKFold example, or use a random access RecordIO interface like MXIndexedRecordIO with KFold from sklearn which returns the partitions.



Thanks Vishaal, I’m not specifically interested in kFold CV, I’m just wondering how would one decide in how many (and what size) .rec files should a dataset be split


I misunderstood the question. I thought you were asking if you would re-split every (k-Fold) cross-validation.

It’s difficult to make a very specific recommendation, as ultimately you’re dealing with a hyper (hyper) parameter as you mention. But see below:

The main benefit of splitting into multiple files is for distributed training, so that you can read data in parallel. So if you have n workers, splitting your data into k \cdot n files with k=1 is a reasonable idea. You may treat k as a hyperparameter if you’re dealing with very large files because there are OS and hardware constraints and potentially speed limitations of using huge files. Additionally transferring a 10gigabyte file to S3 may be slower than transferring 10 one gig files in parallel. In the other direction, many small files will not allow for contiguous reads and will make inefficient use of your harddrive (e.g. not filling your page size, having long seek distance for platters).