Saving Dataloader (need to persist batched data)


I have a task where I have fairly complex data preprocessing (using GluonNLP bucketer). It takes an hour to set up the whole DataLoader.

Does there exist an easy way to save output of DataLoader?
What’s the simplest way to persist batched data?


You can save NDArray’s using .save but I don’t see this done very often.

Usually the network training far outweighs the data preprocessing, so it is sufficient to generate batches on the fly. Are you making use of num_workers on the DataLoader to parallelise the creation of batches?

If this really isn’t an option (and you have sufficient memory), you could concatenate all of the processed batches and save to disk. And then load back in later and use the ArrayDataset.