Scalable (parallel) recordio files creation?

Hi dear MXNet experts,

What is the recommended way to convert a huge collection of images to recordio files? Can several instances of im2rec be used in parallel?

You could set the option num_threads, for instance:

python im2rec.py ./example_rec ./example/ --recursive --pass-through --pack-label --num-thread 8

This will spawn 8 sub-processes, each reading a different set of files and sending the data to one writer process which will create the final output file. This should significantly speedup the reading.

but there is no way to distribute with independent instances of im2rec?

The only way right now is to create splits, which logically splits the image-list file to nsplit parts, but you need to define nsplit and partid. Having said that there is some work ongoing to add new functionalities to the im2rec-tool including a feature where multiple files can be written: https://cwiki.apache.org/confluence/display/MXNET/Image+Transforms+and+RecordIO+file+Creation

1 Like