Question about the clojure BERT example

setzer22 · October 30, 2019, 10:39am

I have a question regarding the clojure mxnet example using the BERT for sentence pair classification: https://github.com/apache/incubator-mxnet/tree/master/contrib/clojure-package/examples/bert

I’m not sure if I should post questions about a contrib package here, but I think the question is general enough that it applies to mxnet in any language.

In the example, the author uses the following code to build a classifier on top of BERT:

(defn fine-tune-model
  "msymbol: the pretrained network symbol
   num-classes: the number of classes for the fine-tune datasets
   dropout: The dropout rate amount"
  [msymbol {:keys [num-classes dropout]}]
  (as-> msymbol data
    (sym/dropout {:data data :p dropout})
    (sym/fully-connected "fc-finetune" {:data data :num-hidden num-classes})
(sym/softmax-output "softmax" {:data data})))

Basically, starting from the BERT-base model (in msymbol), it plugs a fully-connected classifier (a single-hidden-layer MLP, I assume?) on top of it.

My question is the following: Usually, the BERT model has an output embedding corresponding to the [CLS] token that should be used for classification, but in this case, I don’t see how the author is plugging the output of just this [CLS] token to the classifier. I think they are just plugging the whole output layer from BERT. Other packages have a built-in BertClassifier class of some kind that does this for you.

So, is there some magic going on in this example that I’m not seeing, or is the author just plugging the whole output layer of BERT into a classifier (when typically, you should just use the CLS token)?

Thank you very much

thomelane · October 31, 2019, 9:02pm

Hi @setzer22,

Yes, you’re correct that using the [CLS] token was recommended for classification tasks, but this assumes you’re fine-tuning the whole network (the last transformer layer at the very least). I’m not familiar with Clojure so can’t tell on first glance, but if only the head network is being trained here, it would be better to use all outputs instead of just the [CLS] token.

thomelane · October 31, 2019, 9:28pm

Just checked GluonNLP and I think the slice might be included in the symbol file being used, rather than being added in Clojure. Assuming the output is in NTC format the first time step is being sliced to give the [CLS] token output. Code I used to general a symbol file:

git clone https://github.com/dmlc/gluon-nlp.git
cd gluon-nlp
pip install -e .
pip install mxnet --pre
cd scripts/bert/
python export.py --task classification --output_dir folder/goes/here --seq_length 128

setzer22 · November 4, 2019, 1:52pm

Thanks! This makes a lot of sense

Topic		Replies	Views
Accessing mxnet pretrained BERT models not from python Discussion	1	525	November 12, 2019
Training with auxialiary output layer: initializer required for another label variable Discussion	2	484	December 11, 2019
Pretrained model with different number of output	0	300	January 3, 2020
Predicting image class using CGAN_mnist_R example Discussion	1	352	October 17, 2018
MLP Example from "Straight Dope" Throwing NotImplementedError Gluon	1	624	March 16, 2018

Question about the clojure BERT example

Related Topics