GluonNLP: Numba Error with Word Embeddings Training

I’m following the training tutorial here: https://gluon-nlp.mxnet.io/examples/word_embedding/word_embedding_training.html

But using my own custom dataset (space-separated and cleaned).

The error I am getting is:
Beginnign epoch 1 and resampling data.
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
in
----> 1 train_embedding(num_epochs=5)

<ipython-input-37-98c218d225ec> in train_embedding(num_epochs)
      8 
      9         print('Beginnign epoch %d and resampling data.' % epoch)
---> 10         for i, batch in enumerate(batches):
     11             batch = [array.as_in_context(context) for array in batch]
     12             with mx.autograd.record():

~/anaconda3/envs/mxnet/lib/python3.7/site-packages/gluonnlp/data/stream.py in _closure()
    120             istuple = isinstance(item, tuple)
    121             if istuple:
--> 122                 yield self._fn(*item)
    123                 while True:
    124                     try:

/Volumes/archive/deardenlab/guhlin/kmer_vec_embed/data.py in cbow_fasttext_batch(centers, contexts, num_tokens, subword_lookup, dtype, index_dtype)
    324     """Create a batch for CBOW training objective with subwords."""
    325     _, contexts_row, contexts_col = contexts
--> 326     data, row, col = subword_lookup(contexts_row, contexts_col)
    327     centers = mx.nd.array(centers, dtype=index_dtype)
    328     contexts = mx.nd.sparse.csr_matrix(

~/anaconda3/envs/mxnet/lib/python3.7/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    374                 e.patch_message(msg)
    375 
--> 376             error_rewrite(e, 'typing')
    377         except errors.UnsupportedError as e:
    378             # Something unsupported is present in the user code, add help info

~/anaconda3/envs/mxnet/lib/python3.7/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
    341                 raise e
    342             else:
--> 343                 reraise(type(e), e, None)
    344 
    345         argtypes = []

~/anaconda3/envs/mxnet/lib/python3.7/site-packages/numba/six.py in reraise(tp, value, tb)
    656             value = tp()
    657         if value.__traceback__ is not tb:
--> 658             raise value.with_traceback(tb)
    659         raise value
    660 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<built-in function getitem>) with argument(s) of type(s): (array(int64, 1d, C), float64)
 * parameterized
In definition 0:
    All templates rejected with literals.
In definition 1:
    All templates rejected without literals.
In definition 2:
    All templates rejected with literals.
In definition 3:
    All templates rejected without literals.
In definition 4:
    All templates rejected with literals.
In definition 5:
    All templates rejected without literals.
In definition 6:
    All templates rejected with literals.
In definition 7:
    All templates rejected without literals.
In definition 8:
    All templates rejected with literals.
In definition 9:
    All templates rejected without literals.
In definition 10:
    TypeError: unsupported array index type float64 in [float64]
    raised from /Volumes/userdata/staff_users/josephguhlin/anaconda3/envs/mxnet/lib/python3.7/site-packages/numba/typing/arraydecl.py:71
In definition 11:
    TypeError: unsupported array index type float64 in [float64]
    raised from /Volumes/userdata/staff_users/josephguhlin/anaconda3/envs/mxnet/lib/python3.7/site-packages/numba/typing/arraydecl.py:71
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: typing of intrinsic-call at /Volumes/archive/deardenlab/guhlin/kmer_vec_embed/data.py (481)

File "data.py", line 481:
def cbow_lookup(context_row, context_col, subwordidxs, subwordidxsptr,
    <source elided>
    for i, idx in enumerate(context_col):
        start = subwordidxsptr[idx]
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

(The Beginnign typo is from the tutorial).

My vocab size is 28,852,611 and I’m only looking for subwords of size 7 or 9 (or both). But there are a decent number of subwords ( 3,065,880 ).

It works with the Text8 dataset, so I’m not sure where mine is diverging and going wrong, as much of it is copy and pasted from the tutorial.

Thanks,
–Joseph

The code I’m using to prepare the dataset is:
dataset = nlp.data.CorpusDataset(“out.ftinput”)
counter = nlp.data.count_tokens(itertools.chain.from_iterable(dataset))
vocab = nlp.Vocab(counter, unknown_token=None, padding_token=None,
bos_token=None, eos_token=None, min_freq=5)
idx_to_counts = [counter[w] for w in vocab.idx_to_token]

def code(sentence):
    return [vocab[token] for token in sentence if token in vocab]

dataset_t = dataset.transform(code)

What is the format of your dataset? Is it UTF-8 encoded? Can you share the data, so that I can try to reproduce the problem?

UTF-8 encoded. I can’t share right now (maybe next week?) but I think I’ve found the problem. It looks to be an artifact of my data, some words are infrequent and removed by min_count to the point that entire samples are empty. When I change min_count to 1 it seems to work.

Hate to change topic, but is there any way to remove the < and > from beginning and end of words when generating subwords? Without editing the gluonnlp/vocab/subwords.py directly?