Dataset .transform mapping over tuples conventions

#1

From the Dataset documentation

Returns a new dataset with each sample transformed by the transformer function fn.

If that’s the case, why does

dummy_dataset = gluon.data.SimpleDataset([(1,2), (3,4), (5,6)])
dummy_dataset.transform(lambda src_tgt: src_tgt[0])[:1]

Return (1, 2)

Whereas
dummy_dataset.transform(lambda src_tgt: src_tgt[0])[0]

Fails with


TypeError Traceback (most recent call last)
in ()
----> 1 dummy_dataset.transform(lambda src_tgt: src_tgt[0])[0]

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataset.py in getitem(self, idx)
126 item = self._data[idx]
127 if isinstance(item, tuple):
–> 128 return self._fn(*item)
129 return self._fn(item)
130

TypeError: () takes 1 positional argument but 2 were given

Also, somehow dummy_dataset.transform(lambda src, tgt: tgt)[0] works as expected (returns 2), but

dummy_dataset.transform(lambda src, tgt: tgt)[:1] fails


TypeError Traceback (most recent call last)
in ()
----> 1 dummy_dataset.transform(lambda src, tgt: tgt)[:1]

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/gluon/data/dataset.py in getitem(self, idx)
127 if isinstance(item, tuple):
128 return self._fn(*item)
–> 129 return self._fn(item)
130
131

TypeError: () missing 1 required positional argument: ‘tgt’

#2

Hi it’s me again :grinning:.
Below code would work 100%.
dummy_dataset = gluon.data.SimpleDataset([(1,2), (3,4), (5,6)])
dummy_dataset.transform(lambda src_tgt: src_tgt[0])[:][:1]

Should return (1,)
or
dummy_dataset = gluon.data.SimpleDataset([(1,2), (3,4), (5,6)])
dummy_dataset.transform(lambda src_tgt: src_tgt[0])[:][0]

Should return 1

#3

Hi @lambdaofgod,

So I think the confusion arises from the intended usage of a Dataset. It is used to retrieve a single sample at a time, rather than a range of indexes. So the following usage is correct:

transformed_dataset = dummy_dataset.transform(lambda src, tgt: tgt)
print(transformed_dataset[0])

DataLoader is the class that consumes the Dataset and it only ever retrieves single samples at a time when constructing a batch. As such, although dummy_dataset[:1] appears to work, it’s not intended to be used this way, and things mess up when you add transform into the mix.

I’ve added more information on your other related question.

1 Like
Dataset bizarre behavior after using .transform