Non-deterministic results after running GluonNLP BERT example

I’ve followed the tutorial at https://gluon-nlp.mxnet.io/examples/sentence_embedding/bert.html, adapting the script to work with my dataset. The tutorials shows how to fine-tune BERT to perform sentence-pair classification. The main differences is that my script is using single-sentence classification and I have a multi-class classification problem.

To test the implementation, I tried a very small dataset (~80 sentences). I was not expecting good results from this. It am just trying it to test that my implementation is working and the network is trained.

One of the things missing in the tutorial, was using the trained model to predict after the training phase. For that, I created the following function:

transform_predict = data.transform.BERTDatasetTransform(bert_tokenizer, max_len,
                                                        class_labels=intent_classes,
                                                        has_label=False,
                                                        pad=True,
                                                        pair=False)

def predict(sentence):
    test_raw = mx.gluon.data.SimpleDataset([[sentence]])
    test = test_raw.transform(transform_predict)

    test_dataloader = mx.gluon.data.DataLoader(test, batch_size=1)
    test_dataloader
    for batch_id, (token_ids, valid_length, segment_ids) in enumerate(test_dataloader):
        with mx.autograd.record():

            # Load the data to the GPU
            token_ids = token_ids.as_in_context(ctx)
            valid_length = valid_length.as_in_context(ctx)
            segment_ids = segment_ids.as_in_context(ctx)
            
            # Forward computation
            out = bert_classifier(token_ids, segment_ids, valid_length.astype('float32'))
            return out.argmax(axis=1)

The problem is, using the function twice on the same data may yield different results. This is probably accentuated by the fact that the accuracy of the network is very low. However, I was expecting running additional forward passes on the network after the training phase would be deterministic.

How come I am getting different results? Is there some kind of randomness that gets executed for each forward pass? Based on my (limited) knowledge of neural networks, I thought a forward pass was totally deterministic, so this is surprising.

Thank you very much in advance