Pad option does not work in mx.io.DataBatch


#1

I prepare the minibatch using ‘mx.io.DataBatch’ with an option of pad=cur_pad.
The ‘cur_pad’ is calculated by ‘batch_size - len(cur_labels)’, which is expected to handle the small size of training data (e.g. data in the last batch). However, It cannot be automatically padded to the length of batch_size.

During training, I got an IndexError of slicing stop 12 exceeds limit of 11 in the last minibatch, where the number of the data is 11 and batch_size=32.


#2

Hi @richard,

It looks to me like this argument was intended only as metadata to indicate how much of the batch is padding and can be ignored when making predictions. So the padding (to 32 in your case) must be added to the data and label before being passed to mx.io.DataBatch and you’ll see with the implemented iterators. You can also see an example of where this pad property is used Module when calling predict.


#3

Ah, is there an approach which can automatically conduct padding, or do we have to do it by ourselves?

BTW, how to deal with the last minibatch in mxnet?


#4

Most of Iterators handle padding for you, i.e. set the pad property on the batch. As an example with NDArrayIter, we have 10 samples in total and a batch size of 6. Our first batch has no padding, and the second batch has padding of 2. Values in these padded positions often take the values from the previous batch (not necessarily zeros), but they are ignored because the padding property is set.

import mxnet as mx

data = mx.nd.random.uniform(shape=(10,2))
label = mx.nd.random.uniform(shape=(10,))
data_iter = mx.io.NDArrayIter(data, label, batch_size=6, last_batch_handle='pad')
batch = data_iter.next()
print("1st batch padding:", batch.pad)
batch = data_iter.next()
print("2nd batch padding:", batch.pad)
1st batch padding: 0
2nd batch padding: 2