I’m really pleased to see how easy it is to do sentence embeddings with ELMo in gluonNLP https://gluon-nlp.mxnet.io/examples/sentence_embedding/elmo_sentence_representation.html
However I don’t understand the output in the demo: what do the various dimensions (2,14,256) represent?
As far as I know, ELMo provides you contextualized embeddings for words of the sentence, not a whole sentence embedding. This explains the result you receive by running the example.
If you run the example you get output of 3 NDArrays all shapes of (2, 14, 256):
[(2, 14, 256), (2, 14, 256), (2, 14, 256)].
- You get 3 NDArrays because there are 3 RNNs in the Encoder. This number can vary depending on which model architecture you select
- Because in the example they use use batch_size of 2, you get 0 dimension of 2
- You get 14 - this is a number of tokens (words) in your sample
- Last dimension is the embedding size - it depends on the model. In the example they use
elmo_2x1024_128_2048cnn_1xhighway- size of the embedding is 128 and since it is bidirectional, you get 256.
According to the paper, if you want to get a single sentence embedding from these vectors, you need to multiply each vector by a learnable, scaling parameter and then sum them up. (for simplicity, you may want to avoid training these scaling parameter - just assume that the scale is always 1 for all layers).
Alternatively, you can use BERT - it also comes pretrained with GluonNLP.
so how should one go from that output to a sentence embedding? I’m a bit confused by the documentation title which is “Extract sentence features with pre-trained ELMo”
Just to clarify a small thing, I think here the first embedding is the charcnn non contextualized embedding, and because there are 2 RNNs in the example, the rest of the explanation is correct.
It is a sentence embedding because each word has embeddings that are contextualized in the sentence + one non contextualized embedding that only depends on the word itself.
The simplest way is to sum these NDArrays up.
This number pops up as the longest sequence of the first batch.
In the example they use
TextLineDataset, which treats each line as an example. So, after the tokenization and adding bos and eos tokens, the first batch will consist of 2 records:
['<bos>', '<eos>'](they have line break in the code, when defining the test string, so it is threated as an empty line)
['<bos>', 'Extensive', 'experiments', 'demonstrate', 'that', 'ELMo', 'representations', 'work', 'extremely', 'well', 'in', 'practice', '.', '<eos>']
The second example has 14 tokens, and when it is put together in the batch using batchify_fn that has Pad for the first element:
dataset_batchify_fn = nlp.data.batchify.Tuple(nlp.data.batchify.Pad(), nlp.data.batchify.Stack()), the first example also get the length of 14.
The next batch will have different number of words, but also equal to the biggest number in that particular batch.
Summing and normalizing by the number of words per sentence is one way as Sergey mentioned and you can try to see if it works for your use case.
If you use BERT you can try to use the embeddings of the [CLS] token as overall sentence embeddings.