Question about word embeddings in MXNet



I have images of handwritten lines and I need to recognize the text in those images. For that, I am using 4 CNN layers followed by 2 bi-lstm layers and using ctc loss function.
I am using MXNet Gluon to do this.

I am doing word embedding on labels using ‘mxnet.contrib.text.embedding’ and pretrained ‘fastText word embedding’. I am getting vectors for each label of size (n, 300) where n is number of words in that label(line) and 300 is the length of the embedding vector. I am padding the vector into a fixed size (Seq_len, 300) where seq_len = 100. Then I am getting a vector of shape (100, 300) for every label.
But when I fed the labels to the model for training, I am getting an error saying “label array must be of rank 2 but got 3”. Then I flattened the labels but got another error saying “number of labels should be <= sequence length”

Is my approach correct? Please help me soving this issue.

Please find the attached screenshot for the code that I used to create word embeddings for labels.



I’m not sure what you mean by “I am doing word embedding on labels”. Shouldn’t the label be a sequence of integers corresponding to the correct words the network must have predicted? Embedding is used when input to the network in words.

Could you please point to the paper (or another reference) you are trying to implement?


Thanks for the reply. I have the images of handwritten lines from IAM lines dataset. The label is text in the image. Previously I was doing character wise encoding. But, now I want to do word vector representation of labels.



Thanks for the additional information. The standard solution in this case is to make the network predict a probability distribution over all words in the vocabulary. Label for a sentence will be a list of integers with each integer representing a word.

I’ve never seen labels being converted to embeddings. How do you then compute the loss? Is there a paper you can point to that does something like this?


Thanks @indu
There is no specific paper pointing this. But now with character encoding, I am getting 85% accuracy. These are the predictions I got from my model:

I thought that I can reduce some of the mispredictions by using word vector representation of labels (ex: word2vec) so that the model can predict the most likely words that come together. For example: In the image above, the label of second line is: “hand to stop…”, but the model predicted it as "hand to slop… ". I thought that if I use word vector representation, I can correct those mis predictions.
Please correct me if I am wrong

Thank in advance.


indu – What Harathi is talking about is not an integer or 1-of-N encoding of words. It’s a word2vec or glove type word encoding as a set of 300 or 1000 numbers that encode for a meaning of a word. This is actually a common thing to do in speech recognition to train to word embeddings. And it actually does to a good job of predicting words that are not in the training set because the word embeddings are around meaning.




Could you please point to a specific paper or sample code (even from a different framework) that uses word embedding for labels so that I can understand what you are describing?


sure… a simple google search of “word embedding labeling” gives lots of references to to using it in parsing and POS tagging:

if you google “word embedding labeling speech” you find papers like:

Some work learns the word embeddings from speech:

For super large vocabulary sizes (Finnish ~ 2.4 million) word2vec type methods are very useful:

And of course there’s tons of stuff on word2vec:




Thanks, I think we have a good understanding of what a word embedding is. I am still not sure I understand how you plan to use word embeddings, (which as you say correctly say, are constructed as to maximize contextual similarities between word), in order to improve the visual recognition of a word.

For example in @harathi example, the word ‘slop’ and ‘stop’ are not going to be any close in their embedding representation, since they different meaning and the context they are used in are very different.

A way to reduce the mis-prediction would be to run a post-processing error-correction phase after your OCR phase, there are several techniques, that combines statistical language modeling (SLM) and visual distance between letters.
For example see:
Look into ‘context-based error correction’. This google search should list you some papers that could give you a good list of papers to look into:

edit: here is a paper that create a same embedding space for images and text, that might be closer to what you are looking for

I hope that helps,



We are not trying to recognize single words by themselves but in full sentences and paragraphs, so recognizing words visually (or speech) makes a whole lot of sense. In fact even the basis of what you suggest by doing post processing error correction on OCR, a lot of those methods are based on word embeddings too! If you’ve written one of those error-correction algorithms for OCR (I have), they really don’t do all that great a job without taking into consideration the context. So all those things combine to make really smart sense in fact more sense to actually use word embeddings for handwriting, hand printing and OCR.

We would like to just run those experiments ourselves and Harathi has most of that work done but is running into a technical difficulties setting up the model. We’d appreciate actual help with that model not miss directions down paths we’ve already tried and know are only so-so in their results.


@Charles_Rentmeesters: There are two reasons why CTCLoss does not work with your intended use of word embedding as labels:

  1. CTCLoss requires the output layer to be able to have a “blank” output. A continuous embedding space cannot have a categorical “blank” output.
  2. CTCLoss internally uses maximum likelihood to calculate the loss. Maximum likelihood is only appropriate if your outputs are categorical probabilities and a continuous embedding vector is not a categorical probability.

I highly recommend reviewing the original CTC paper, particularly section 3. This is a very well written paper and clearly explains the conditions and assumptions around CTC formulation.