Is there a way to convert Dataset to NDArray?


#1

I’m interested in using a Gluon vision dataset with MXNet Module API.

I’ve thought that I’d need to convert Dataset/DataLoader to NDArray.
What is the fastest and cleanest way to do it? Are there any compatibility tools? From what I’ve seen there is no mention of making NDArrays from Datasets or DataLoaders in documentation.

Or maybe there is another way?


#2

If you have a Dataset, you can just use indexing to get NDArrays. Example:

# imagex and labelx are NDArrays
image0 = dataset[0][0]
label0 = dataset[0][1]
image1 = dataset[1][0]
label1 = dataset[1][1]

If you have a ‘DataLoader’, you can iterate through it to get NDArray. Example

for image, label in dataloader:
    # image and label are NDArrays
    ...

#3

That doesn’t answer the question. I don’t just want to get an iterator, I want to have an object that conforms to NDArray, so I can use it for example in module’s fit method.


#4

Hi @lambdaofgod, what @indu has written is spot on but maybe the info below will help.

If you look at the code, DataSet, and DataLoader are two different entities. DataSet is an abstract base class that you need to derive from to write your own object loader. DataLoader is another object that is wraps around a particular dataset and spits out batches of data in pairs of (input, label).

By definition, when training you have pairs of “input” , “labels”. These will have in general dimensions of (Nbatch, single_input_dim), (Nbatch, single_label_dim), where Nbatch is the batch size and single_input_dim can be (for images say) 5 Channels x 128 x 128 etc. So when you ask how to convert DataSet to NDarray is not clear: what will be the shape of the NDArray that you are seeking? Do you want to have in a single NDarray object both input and label? Or do you want them in two separate NDArrays? (easy to translate between the two). Do you want them split in batches?

By definition (it really depends on how you implement it though) a DataSet spits out pairs of a single (input, label). The line of code that does so is in __getitem__ which is like list indexing:

class YourDataSet(gluon.data.Dataset):
    def __init__(self, some_arguments):
          # Do here what you need to load your parameters somehow
          # For example you can create a list of pairs of filenames for (img_name, label_name)

         self.img_list = # Some list of img_names 
         self.label_list = # some list of corresponding label_names 


     # These two are methods that need to be overwritten in your implementation
    def __getitem__(self,idx):
         # example implementation
        return self.img_list[idx], self.label_list[idx]

    def __len__(self):
        # example implementation
        return len(self.img_list) 

So now, having your own YourDataSet class deriving from gluon.data.Dataset if you want to translate that into two separate NDArrays you can just do

dataset = YourDataSet(some_arguments)
inputs = nd.array([dataset[i][0] for i in range(len(dataset) )])
labels = nd.array([dataset[i][1] for i in range(len(dataset) )])

and take it from there. I don’t think that DataLoader can be of use to you (but I may be wrong). I am not familiar with the Module API to provide more information on how to use these there.

Hope this helps.