Hi, there are two contradicting statements in the recordIO documentation https://mxnet.incubator.apache.org/architecture/note_data_loading.html:
- “Do the packing once. We don’t want to repack data every time run-time settings, like the number of machines, are changed”
- “We don’t need to consider distributed loading issue at the preparation time, just select the most efficient physical file number according to the dataset size and computing resources available.”
Consequently proper usage of .rec is not clear: how config-specific should .rec dataset be? should we hyper-parametrize number of files and cross-validate it every time run-time settings change?
Related question Batch formation from .rec files