Scalable (parallel) recordio files creation?

performance
gluon-cv
#1

Hi dear MXNet experts,

What is the recommended way to convert a huge collection of images to recordio files? Can several instances of im2rec be used in parallel?

#2

You could set the option num_threads, for instance:

python im2rec.py ./example_rec ./example/ --recursive --pass-through --pack-label --num-thread 8

This will spawn 8 sub-processes, each reading a different set of files and sending the data to one writer process which will create the final output file. This should significantly speedup the reading.

#3

but there is no way to distribute with independent instances of im2rec?

#4

The only way right now is to create splits, which logically splits the image-list file to nsplit parts, but you need to define nsplit and partid. Having said that there is some work ongoing to add new functionalities to the im2rec-tool including a feature where multiple files can be written: https://cwiki.apache.org/confluence/display/MXNET/Image+Transforms+and+RecordIO+file+Creation