Example of a multithreaded data iterator

dmadeka · May 4, 2018, 2:19pm

Is there an example of a data iterator which does a lot of multithreaded preprocessing and builds a queue for each GPU?

thomelane · May 4, 2018, 4:33pm

If you’re using Gluon API you can set num_workers on the DataLoader to use multi-processing with any type of Dataset. You typically want to set this to the number of CPUs available for optimal performance which you can find with multiprocessing.cpu_count(). All data loading and preprocessing (e.g. data augmentation) will be performed in parallel across different processes, and automatically added to a queue to be sent to the GPUs. Check out the tutorial here for example of this.

With Module API, you can perform multi-threading (different from multi-processing) for the data loading and augmentation using the preprocess_threads argument of mxnet.io.ImageRecordIter and ImageDetRecordIter.

dmadeka · May 4, 2018, 5:40pm

Yeah, that leads to my real question - how do you guarantee each batch is only sent once? What if two threads/processes call next at the same time? There’s no lock on the actual index update. My current solution is:

        with self.rlock:
            self.index += 1

Alternately, multiprocessing.value works as well

dmadeka · May 11, 2018, 1:13am

@thomelane Just wanted to follow up on this

safrooze · May 11, 2018, 5:07am

I’m not exactly sure what code you’re looking at. In the DataLoader in Gluon, the main process creates a batch of indices that is then passed to each worker process. A worker process fetches a batch of indices and constructs the batch of adata by reading the data at the indices in the batch of indices. This is the code if you’re interested: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataloader.py#L215

dmadeka · May 21, 2019, 9:11pm

Is this possible with native MXNet? Or just gluon?

thomelane · May 22, 2019, 12:26am

Hi @dmadeka, you can use data iterators like mxnet.io.ImageRecordIter that run on the engine (if that’s what you mean by native), but Gluon Dataset and DataLoader paradigm is more flexible and easier to work with.

Topic		Replies	Views
Dataloader for Large dataset Discussion	4	1273	December 31, 2018
Where is mxnet.io.ImageRecordIter implementation/ documentation? Discussion	3	972	May 29, 2018
What happen? One of the data from the same iterator is on the CPU and the other is on the GPU. Is this a bug? Gluon	2	517	April 28, 2018
Multi-Threaded Inference Question	1	1013	July 4, 2019
Problem with multiprocessing and CPU shared storage Gluon	3	1870	July 28, 2019

Example of a multithreaded data iterator

Related Topics