Am I obliged to use .rec files to read images from S3?

best-practices
gluon
python
#1

I find this a bit confusing.

Several places across the MXNet documentation read: Any data iterator that can read/write data from a local drive can also read/write data from S3.

Probably I am missing something, but I guess this is true ONLY IF the data has been pre-packaged into a .rec file.

Reading raw JPGs from S3 does not work for me.

In an ideal world I would like to run the following
data_iter = mxnet.gluon.data.vision.ImageFolderDataset("s3:/my-bucket-containing-jpg-images")
which evidently does not work.

Am I somewhat obliged to turn all my dataset into .rec and .lst before streaming it from S3?
No raw format supported?

Thanks

0 Likes

#2

I think there’s confusion here because data iterators (from Module API) and data sets (from Gluon API) are different. S3 support is in Module API as far as I can see, but with Gluon API it’s relatively easy to implement.

Are you sure you want to be reading individual files from an S3 bucket to create each batch though? I would expect this to get expensive given the number of requests to S3. It seems to me like the best practice would just be to download the dataset from S3 once and then load as usual. And think about scaling up disk space if you’re dealing with a really large dataset (i.e. on AWS increase EBS volume size).

But it’s totally possible to create a custom Dataset to do this. ImageFolderDataset only supports local file systems (quite a few os commands feature in it’s implementation, e.g. os.listdir(path)) but switching these out for boto3 calls would give you something like:

import cv2
import boto3
import mxnet as mx
from pathlib import Path
import numpy as np


class S3ImageFolderDataset(mx.gluon.data.Dataset):
    def __init__(self, bucket_name, prefix):
        """
        Use same folder format as ImageFolderDataset
        """
        self._s3_bucket_name = bucket_name
        self._s3_prefix = prefix
        self._s3 = boto3.resource('s3')
        self._s3_bucket = self._s3.Bucket(bucket_name)
        self._s3_objects = []
        for object in self._s3_bucket.objects.filter(Prefix=prefix):
            self._s3_objects.append(object.key[len(prefix)+1:])
        self.synset = list(set([o.split('/')[0] for o in self._s3_objects]))
        self.synset.sort()
        self._label_idx_map = {o: i for i, o in enumerate(self.synset)}
        
    def __getitem__(self, idx):
        s3_object = self._s3_objects[idx]
        s3_object_key = self._s3_prefix + '/' + s3_object
        obj = self._s3.Object(self._s3_bucket_name, s3_object_key)
        contents = obj.get()['Body'].read()
        data_arr = np.frombuffer(contents, dtype='uint8')
        data = cv2.imdecode(data_arr, -1)
        label = self._label_idx_map[s3_object.split('/')[0]]
        return data, label
        
    def __len__(self):
        return len(self._s3_objects)
    

dataset = S3ImageFolderDataset(bucket_name='test_bucket', prefix='test/upload')
1 Like

#3

Oh, and you might also find these useful to test things out. I wrote some code to upload 100 samples of CIFAR10 to an S3 bucket in the format required by ImageFolderDataset.

# save files to local disk
samples = 100
dataset = mx.gluon.data.vision.CIFAR10()
for idx in range(samples):
    sample = dataset[idx]
    filepath = Path('./test/upload/class{}/sample{}.jpeg'.format(sample[1], idx))
    filepath.parent.mkdir(parents=True, exist_ok=True)
    cv2.imwrite(str(filepath), sample[0].asnumpy())

# upload files to S3
folder = Path('./test/upload')
files = [f for f in folder.glob('**/*.jpeg')]
s3 = boto3.resource('s3')
s3_bucket = s3.Bucket('test_bucket')
s3_bucket
for file in files:
    s3_bucket.upload_file(str(file), str(file))
0 Likes

#4

You rock as usual man! Thanks a lot

0 Likes