Data format for learning on videos

Hi I see here https://github.com/jay1204/st-resnet/blob/master/model/video_iter.py an MXNet video classification model (st-resnet) where the read_train_frames function reads from folders of pictures (frames_name = [f for f in sorted(os.listdir(video_path)) if f.endswith('.jpg')]). I have the following question:

Is is a correct practice for video classification tasks to pre-process the data by converting videos to downsampled jpg frames? Or is the most frequent approach to have the dataset handling reading, downsampling and extracting arrays directly from .mp4 or some other video format?

Anything that works for you @olivcruche
If you expand every frame from a video to a picture, that is going to end up using a lot of disk space. However if you want to perform video classification by stacking the frames and performing for example 3D convolutions, then you’ll have very little overhead during training time.

If you chose to create your own Dataset that handles file reading, decoding and sampling, it is likely that you are going to end up using more CPU cycles for that, but disk space will be preserved. So if you can afford it, CPU wise, I would suggest to do decoding on the fly, so that you can play around with your decoding parameters without having to re-run the entire offline extraction.