As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self): lines = self.__text_edit_input.toPlainText().split("\n") result = "" for line in lines: result += str(self.__sentence_prob(line, 10)) + "\n" self.__text_edit_output.setPlainText(result) def __prepare_sentence(self, text, max_len): result = mx.nd.zeros([max_len, 1], dtype='float32') max_len = min(len(text), max_len) i = max(max_len - len(text), 0) j = 0 for index in range(i, max_len): result[index] = self.__vocab[text[j]] j = j + 1 return result def __sentence_prob(self, text, max_len): hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context) tokens = self.__tokenizer(text) data = self.__prepare_sentence(tokens, max_len) output, _ = self.__model(data, hiddens) prob = 0 for i in range(max_len): total_prob = mx.nd.softmax(output[i]) prob += total_prob[self.__vocab[i]].asscalar() return prob / max_len
Possible issues of language models:
- Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
- Number of vocab is too small/big(test 10000, 15000 and 30000)
- Loss too high(ppl around 190) after 50 epochs?
- Number of sentences length should be larger/smaller(tried 10,20,35)
- The data I use do not meet my requirements(not every sentences are logical)