Gluon NLP Batchify


I have an imbalanced sentiment analysis (Positive = 15% of the data) Fine Tuning experiment that I am trying to run with the BERTClassifier. However, it seems that the batchifying (n = 8), ends up creating batches such that the model is learning to classify everything as Negative. Most of the batches have labels of 0. Is there a way to reflect the output feature distribution in the batches created, so that there is at least 1 record in each batch that is Positive?

Once you create a dataset and specify that in creating the DataLoader that the shuffle keyword is True or you use a RandomSampler, you no longer have direct control over how the sampling works each epoch and it will be possible (in your case perhaps even likely, since your batch_size is just 8) that many batches will contain only negative examples.

One thing you can try doing is increasing your batch_size such that probability of a positive example being in any given batch is greater. If you cannot increase your batch_size then you can try oversampling or undersampling while creating the Dataset but before creating the DataLoader.

Alternatively, you can create your own Sampler class that extends which ensures that you get random samples of both classes in every batch and pass that in as the sampler keyword argument in the DataLoader constructor