Handling data too big to fit in memory - what is MXNet's Keras generator analogue?


#1

I’m doing training where my dataset won’t fit on single machine memory.

When I had similar problem in Keras I just used generators, for example the ones from keras.image.

What is the analogue in MXNet? I was looking into documentation on IO and on Gluon, but I didn’t find anything (or I’m wrong, for example datasets seem to have defined length and getter, so it seems like they’re stored in memory).


#2

I think you want
mxnet.recordio.MXIndexedRecordIO.

It lets you randomly pull instances from a file into a batch, rather than pulling in the whole data file at once. Uses two files to do this. An index file, and the data file.

There are other similar iterators too, but I think that is the most flexible.


#4

You need gluon.data.DataLoader and gluon.data.DataSet. Some tutorials are: 1, 2.

For example, in my case for semantic segmentation problems, I have a bunch of imgs-UniqueID.npy and imgs-UniqueID-mask.npy data in a directory with the following subdirectories training/imgs, training/masks, validation/imgs, validation/masks and so on for test (see code below). Then I am using the following dataset wrapper to load a single image:


import os
import numpy as np
import pandas as pd

from mxnet.gluon.data import dataset


class SemSegDataset(dataset.Dataset):
    """
    Usage: the user needs to provide a root directory that has the following structure: 

        root:
            training:
                imgs/
                masks/
            validation:
                imgs/
                masks/
            test:
                imgs/
                masks/


    Each of the corresponding imgs/ and masks/ directories must contain images (numpy format *.npy) where the mask has the same name component as the corresponding image. 
    E.g. img1 = 'img-2345-sdgh.npy'
         mask1= 'img-2345-sdgh-mask.npy'

    This is necessary so as the ordered dictionaries that are constructed to have the correcto correspondence between images and masks. 
    """

    def __init__(self, root, mode='train', transform=None,norm=None):

        # Transformation of augmented data
        self._mode = mode
        self._transform = transform
        self._norm = norm # Normalization of img

        # Take into account how root directory is entered
        if (root[-1]=='/'):
            self._root = root
        else :
            self._root = root + '/'


        if (self._mode == 'train'):
            self._root_img = self._root + 'training/imgs/'
            self._root_mask = self._root + 'training/masks/'

        elif (self._mode == 'val'):
            self._root_img = self._root + 'validation/imgs/'
            self._root_mask = self._root + 'validation/masks/'

        elif (self._mode == 'test'):
            self._root_img = self._root + 'test/imgs/'
            self._root_mask = self._root + 'test/masks/'

        else :
            raise Exception ('I was given inconcistent mode, available choices: {train, val, test}')



        # Read images and masks list - sorted so they are in correspondence. 
        self._image_list = sorted(os.listdir(self._root_img))
        self._mask_list = sorted(os.listdir(self._root_mask))


        assert len(self._image_list) == len(self._mask_list), "Masks and labels do not have same numbers, error"


    def __getitem__(self, idx):

        base_filepath = os.path.join(self._root_img, self._image_list[idx])
        mask_filepath = os.path.join(self._root_mask, self._mask_list[idx])


        # load in float32
        base = np.load(base_filepath)
        base = base.astype(np.float32)

        mask = np.load(mask_filepath)
        mask = mask.astype(np.float32)


        if self._transform is not None:
            base, mask = self._transform(base, mask)
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

        else:
            if self._norm is not None:
                base = self._norm(base.astype(np.float32))

            return base.astype(np.float32), mask.astype(np.float32)

    def __len__(self):
        return len(self._image_list)
~                                                                     

This is more complicated as it is required the user to provide (optionally) a normalization function (for each image) (e.g. standardization) and a transform (this relates to data augmentation, see gluon tutorial). Then you can use this function (with a gluon.data.DataLoader) in a for loop to train your network in the following way.

# This is how you define it, with optional normalization and augmentation functions. 

Nbatch = 32

#tnorm = ISPRSNormal() # Some normalization function
tnorm = None
#ttransform = SemSegAugmentor() # Some data augmentation function 
ttransform = None
root = r'/home/foivos/Data/'
dataset = SemSegDataset(root,mode='train', norm=tnorm, transform=ttransform)
datagen = gluon.data.DataLoader(dataset,batch_size=Nbatch,last_batch='rollover',shuffle=True,num_workers=8)

and this is an example of a for loop to use it

for i, data in enumerate(datagen):
    imgs, masks = data
    # do stuff 

    break # this stops the iteration after first batch is loaded

Hope this helps. By the way, I don’t think numpy arrays are the most efficient way to go, works for me (but I haven’t done thorough code profiling).