Need help with flickr8 dataset tokenizing words

#1

I have been working with on this for a couple of weeks now. I have accomplished a lot but I’m stuck on how to process/tokenize the words. I have been following the Word Embedding tutorial my vocab is good idx_to_counts works but stuck on tokenize most common error: list’ object has no attribute ‘transform’ .

#2

Can you provide a small reproducible example? Do you use a tokenizer provided by GluonNLP https://gluon-nlp.mxnet.io/api/data.html#transforms ?

#3

This is straight out of word embeddings the counter and vocab work fine. I’m try to figure out how to transform the sentences into tokens I was also thinking of trying the SpacyTokenizer this is my first attempt at NLP, I’m comfortable with images and convolution based nets but NLP is much more difficult. My long term goal is to reproduce One neural network, many uses:https://towardsdatascience.com/one-neural-network-many-uses-image-captioning-image-search-similar-image-and-words-in-one-model-1e22080ce73d

counter = nlp.data.count_tokens(itertools.chain.from_iterable(descriptions))
vocab = nlp.Vocab(counter, unknown_token=None, padding_token=None, bos_token=’’, eos_token=’’, min_freq=5) this return almost exactly the same number of words that articles from the arxiv I’ve read have.

idx_to_counts = [counter[w] for w in vocab.idx_to_token]
def code(sentence):
return [vocab[token] for token in lines if token in vocab]

flickr8 = lines.transform(code, lazy=False)
this fails list error. I have close to 40 functions that convert the text in so many different ways that I’m stuck on which to use and the order to use them in.

#4

The reason why you get the error is, that lines is a list of tokens. You have to call transform on a nlp.data-object.
The following should work:

text8 = nlp.data.Text8()
def code(sentence):
   return [vocab[token] for token in sentence if token in vocab]
text8 = text8.transform(code, lazy=False)
#5

There are no errors but this doesn’t tokenize the Flickr8 dataset it
returns a long string of mostly repeating numbers. I’m so close I can retrieve an Image/caption and normalize the image pass it through a dense layer. I’m going to use the merge method I just need a way to tokenize the Flickr8 captions. If the above doesn’t work can you point me in the right directions.