MNIST: DataLoader very slow compared to DataIter?

some tutorials advice to use always DataLoader over the older DataIter API.
I did some measurements using the MNIST dataset.
The code using DataIter is twice as fast as with DataLoader.
So do the tutorials give a false advice?
If I have to create/use custom data source I should use the DataIter approach?!

DataIter version:

Epoch [1], Accuracy 0.7693 ~Samples/Sec 55691.5675
Epoch [2], Accuracy 0.9245 ~Samples/Sec 83681.7507
Epoch [3], Accuracy 0.9554 ~Samples/Sec 83172.9648
Epoch [4], Accuracy 0.9657 ~Samples/Sec 83173.5480
Epoch [5], Accuracy 0.9713 ~Samples/Sec 82785.9656
Epoch [6], Accuracy 0.9767 ~Samples/Sec 83346.3328
Epoch [7], Accuracy 0.9791 ~Samples/Sec 83410.4360
Epoch [8], Accuracy 0.9812 ~Samples/Sec 83240.0001
Epoch [9], Accuracy 0.9825 ~Samples/Sec 83247.2604
Epoch [10], Accuracy 0.9833 ~Samples/Sec 82463.6815
elapsed: 7.509
validation accuracy=0.988498

DataLoader version:

Epoch [0], Accuracy 0.8295 ~Samples/Sec 39718.0944
Epoch [1], Accuracy 0.9523 ~Samples/Sec 45847.0761
Epoch [2], Accuracy 0.9683 ~Samples/Sec 47455.3581
Epoch [3], Accuracy 0.9731 ~Samples/Sec 46995.5257
Epoch [4], Accuracy 0.9784 ~Samples/Sec 43869.2088
Epoch [5], Accuracy 0.9809 ~Samples/Sec 46773.8558
Epoch [6], Accuracy 0.9831 ~Samples/Sec 46672.8191
Epoch [7], Accuracy 0.9848 ~Samples/Sec 44058.3605
Epoch [8], Accuracy 0.9856 ~Samples/Sec 45939.9594
Epoch [9], Accuracy 0.9866 ~Samples/Sec 46450.3611
elapsed: 15.238
validation accuracy=0.988700


Data loader is always meant to be slow from a Data iterator.

We use data iterator when the dataset is small enough so that it can be loaded to the available memory(RAM or VRAM), therefore when we use data iterator the dataset is already loaded to the memory, which is then simply iterated based on batch size provided.

Whilst, when the dataset is so big that it can’t be loaded completely to the memory, then we are forced to use a data loader, which doesn’t try to load whole dataset in the memory, instead just loads the current batch to the memory, and then releases the previous batch out of memory to load the next batch.

And doing this makes a data loaded slower than a data iterator, as a data loader has to continuously allocate and deallocate memory.

But couldn’t this be done by custom DataIter too (loading the data on demand).

DataLoader itself creates and returns a _MultiWorkerIter that implements some kind of parallel batch creation from the dataset.
But if the next batches are created in parallel to the consumption of the current batch the performance shouldn’t be that worse.

The data I’m working on consist out of 200.000 recordio files (240GB) containing numpy arrays (offline processing). The number of arrays differs from file to file.
If implemented a iterator (DataIter) that extracts and batches the numpy array from the recordio files in parallel.
The iterator is capable to saturate both GPUs to 93%.
But I’m searching for a more efficient solution that is ‘more’ MXNet compliant.