DataBatch index field, random shuffling and custom iterators

Hi all,

topic: write a meaningful custom iterator which supports shuffling and composite datatypes.

two ways: (A) i write the iterator from scratch myself or (B) I try to embed ImageIter in my own iterator class

(B) has one problem I noticed: the DataBatch returned by mx.image.ImageIter has member DataBatch.index to be None. This problem occurs in the case when path_imglist is provided, no matter whether I have provided path_imgidx or not.
Note: my path_imglist has the sample indices in it. So the information is there (also in the path_imgidx).

Is DataBatch.index= None the desired behaviour?

Why I consider it a problem? Well,suppose I want to provide augmented data, for example a batch of images and a batch of whatever other datatype, and I want to use ImageIter with shuffling for the images, then I would need to access DataBatch.index to know what images I currently have in my batch. Or have I overlooked something?

doc says DataBatch.index should be a numpy. Sure ? numpy or mxarray ? Its not a big thing at all though.

(A) works, but then i have to do all image augmentations myself :frowning: (and it could be slow if I manually use opencv or PIL)
so thats what i would love to avoid, even if i have coded it already

some other observations:
I. the return type of DataBatch.data:

do i misread the following ?
data : list of NDArray, each array containing batch_size examples.
A list of input data.

doc says it should be a list of mx.nd.array. Is that really a requirement from mxnet side?
Currently mx.nd.array does not support np.unicode_ type … if one wanted to provide strings as data …

there is a workaround: seems in my custom iterator
DataBatch.data[1] can be a python list, and nothing complained so far (but i did not try transfer learn stuffs yet)

II. the SimpleIter example in https://mxnet.incubator.apache.org/tutorials/basic/data.html confuses me:

it uses zip for _provide_data which creates a(n iterator of) tuple I thought. However provide_data is expected to be a DataDesc class, which is derived from namedtuple. any method which expects something from datadesc which is not in tuple may fail.
Do I hallucinate or would it be better to use a datadesc object in that example ?

III. ImageIter doc could have a link provided to image.CreateAugmenter :slight_smile:
aug_list=None … a link what aug_list could be seems to be not given

minor: IV. DataDesc description [link removed, new user can post only 2 links]

cls (DataDesc) – The class. … that part confuses me, seems to work without it

V. data augmentation with image.CreateAugmenter(data_shape, resize=0, rand_crop=False, rand_resize=False, rand_mirror=False, mean=None, std=None, brightness=0, contrast=0, saturation=0, hue=0, pca_noise=0, rand_gray=0, inter_method=2)

How that works is not clear from its doc. can you turn on resize to one size, then rand_crop to another size ?
that would require 2 size parameters, while data_shape allows for one only, but maybe I am wrong here
In general the relationship between multiple augmentations and data shape is not clear to me.

Some more info on data augmentation would be helpful (e.g. in a tutorial).

VI. strange error when trying to use data augmentation:

augs = mx.image.CreateDetAugmenter(data_shape=(3, 300, 300), rand_crop=0.5, rand_mirror=True, brightness=0.125, contrast=0.125, saturation=0.125 )

imgiter=mx.image.ImageIter(batch_size=5, data_shape=(3,300,300), label_width=2, path_imgrec=None, path_imglist=imglist, path_root=path_root, path_imgidx=indexlist, shuffle=False, part_index=0, num_parts=1, aug_list=augs, imglist=None, data_name=‘data’, label_name=‘softmax_label’)

python mxiter_py [modified to avoid link filter]
Traceback (most recent call last):
File “mxiter_py”, line 260, in
tester2()
File “mxiter_py”, line 245, in tester2
b=imgiter.next()
File “…”, line 1181, in next
data = self.augmentation_transform(data)
File “…”, line 1239, in augmentation_transform
data = aug(data)
TypeError: call() missing 1 required positional argument: ‘label’

Just FYI, image augmenter operators, which will be easier to use, will come soon. You can track them here: https://github.com/apache/incubator-mxnet/issues/8556

Is databatch.index=None a bug or desired when the list file has sample indices?