Can somebody explain why a concatenation of states & states[-1] is used as input for the output layer? I understand why states[-1] would be used but not clear about why states would be useful. Is this because the model is bidirectional? For a unidirectional lstm, would we just use states[-1]?
Also, the rnn.LSTM API indicates that it returns both the output and hidden state. Why not use output[-1] directly as input to the softmax?
states means it’s the state to the encoder, you may refer to http://d2l.ai/chapter_recurrent-neural-networks/encoder-decoder.html
states actually contains the outputs (e.g. hidden state) of the decoder (the last LSTM layer) for all time steps, (call it
outputs might be better). states is the output of time 0, state[-1] is the output of the last time. We need states because of using bilstm here.
Just to clarify, according to documentation for LSTM:
data: input tensor with shape (sequence_length, batch_size, input_size) when layout is “TNC”. For other layouts, dimensions are permuted accordingly using transpose() operator which adds performance overhead. Consider creating batches in TNC layout during data batching step.
states: a list of two initial recurrent state tensors. Each has shape (num_layers, batch_size, num_hidden). If bidirectional is True, shape will instead be (2*num_layers, batch_size, num_hidden). If states is None, zeros will be used as default begin states.
out: output tensor with shape (sequence_length, batch_size, num_hidden) when layout is “TNC”. If bidirectional is True, output shape will instead be (sequence_length, batch_size, 2*num_hidden)
out_states: a list of two output recurrent state tensors with the same shape as in states. If states is None out_states will not be returned.
class BiRNN(nn.Block): def __init__(self, vocab, embed_size, num_hiddens, num_layers, **kwargs): super(BiRNN, self).__init__(**kwargs) self.embedding = nn.Embedding(len(vocab), embed_size) # Set Bidirectional to True to get a bidirectional recurrent neural # network self.encoder = rnn.LSTM(num_hiddens, num_layers=num_layers, bidirectional=True, input_size=embed_size) self.decoder = nn.Dense(2) def forward(self, inputs): # The shape of inputs is (batch size, number of words). Because LSTM # needs to use sequence as the first dimension, the input is # transformed and the word feature is then extracted. The output shape # is (number of words, batch size, word vector dimension). embeddings = self.embedding(inputs.T) # The shape of states is (number of words, batch size, 2 * number of # hidden units). states = self.encoder(embeddings) # Concatenate the hidden states of the initial time step and final # time step to use as the input of the fully connected layer. Its # shape is (batch size, 4 * number of hidden units) encoding = nd.concat(states, states[-1]) outputs = self.decoder(encoding) return outputs
self.encoder(embeddings) only passes in the input
embeddings without states (“states is None” in the doc), so it only returns
out according to the documentation (i.e., “If states is None out_states will not be returned.” in the doc)
We have improved the clarity in the following PR:
Thanks for the detailed explanation.