Language Model Data Sets

https://en.diveintodeeplearning.org/chapter_recurrent-neural-networks/lang-model-dataset.html

In class Vocab, we sort tokens 1) in decreasing order of counts and 2) lexicographical order of tokens if there are the same number of tokens by these two lines:

self.token_freqs = sorted(counter.items(), key=lambda x: x[0])
self.token_freqs.sort(key=lambda x: x[1], reverse=True)

I understand why we need to sort tokens by counts, but I wonder if there is any reason to sort them in lexicographical order. Is there any specific reason?

1 Like

"The modification we did here is that corpus is a single list, not a list of token lists, since we do not the sequence information in the following models. " does not make sense, especially “since we do not the” part

1 Like