Why bucketing (vs direct batching) in gluonNLP?


Hi, I saw that sorted/fixed bucketing are notable features of gluonNLP. It is not clear for me why those steps (putting similar-size sequences in same batches) needs to involve the concept of buckets. Couldn’t it be done directly as the dataset is shaped into batches?



The bucket abstraction provides cleaner separation of concerns and keeps the code that does the bucketing (sampler) separate from the code that iterates through the data (data iterator). Specifically the bucketing ‘plugs’ on top of the iterator. Additionally, there are also many bucketing strategies so abstracting the concept makes it easier to keep code clean.

To directly answer your question, the bucketing code does essentially what you describe, so in theory you could write it on top of the data iterator, but you would end up with something similar to the sampler if you followed the natural separation of concerns pattern.

There’s a good description of bucketing strategies here: https://gluon-nlp.mxnet.io/api/notes/data_api.html
that illustrate the kinds of things you can use buckets for such as having variable batch_sizes to make sure that you’re putting enough data ‘through’ the GPU for small sequence lengths.



excellent as usual, thanks!