Inconsistent shape following an LSTM fed with variable length inputs

I have the following Decoder which is fed with captions of various sizes (captions in a batch are padded to the same length).

The problem is that on the first forward pass, the linear layer locks the dimensions of the input, which don’t match those belonging to the same batch and onward. throwing me: Error in operator dense1_fwd: Shape inconsistent, Provided = [9956,9728], inferred shape=(9956,9216).

What are the possible solutions that I can employ to overcome this issue, preferably without changing much in the network design.

class DecoderRNN(HybridBlock):

    def hybrid_forward(self, F, features, captions, *args, **kwargs):
        embeddings = self.embed(captions)
        features_and_embeddings = F.concat(features.expand_dims(axis=1), embeddings, dim=1)
        output = self.lstm(features_and_embeddings)
        result = self.linear(output)
        return result

    def __init__(self, embed_size: int, hidden_size: int, vocab_size: int, num_layers: int):
        super(DecoderRNN, self).__init__()
        self.embed = Embedding(input_dim=vocab_size, output_dim=embed_size)
        self.lstm = LSTM(hidden_size, num_layers, layout="NTC")
        self.linear = Dense(vocab_size, flatten=True)

    def initialize(self, **kwargs):


I believe when flatten is set to true on a Dense block, the expectation is that input will be of the form NC where N can be variable and C must be fixed. From the error message Shape inconsistent, Provided = [9956,9728], inferred shape=(9956,9216) you have posted, it looks like your batch size has remained constant but your channel size has changed.

So, to make sure that your channel size remains the same, you must use flatten=false in the Dense layer. That should solve your problem.


That worked, superb.

The suggestion on setting flatten=False will correct your scenario, but it’s not the channel length that varies, but the sequence length, T, that varies between batches. Looking at the two dimensions, 9278 and 9216, it’s likely your hidden_size is the gcd, which is 512 in this case. This means one batch has padded sequence length of 18, and the other has padded sequence length of 19. Setting flatten=False will allow for the varying padded sequence length as @keerthanvasist mentions.

Trivia: There’s an approximate 60.79…% chance that the hidden_size is 512 given the information above. It’s the probability that the two sequence lengths a, and b are relatively prime with is 6/\pi^2. Probability that gcd(n, m) = 1. Cheers.