How to make LSTM handle with images of different sizes?

harathi · July 25, 2018, 12:44am

Hi,

I implemented CNN LSTM model for text recognition in images. I am extracting image features with CNN and the extracted features are given to LSTM layer. When I trained the model with images of same size (128, 1600), it is doing well. But when i tried to train the model with images of different size, i am getting the following error:

AssertionError: Expected shape (800, 4000) is incompatible with given shape (800, 16384).

I am getting this error at LSTM. With image of size (128, 1600), the shape of the CNN output is (Batch_size, 32, 64, 800). I flattened this which gives (Batch_size, 1638400) and made 100 (sequence_length) splits along axis 1. The resultant ndarray of size (100, Batch_size, 16384) is sent to LSTM.

As the LSTM weights are getting initialized in the first forward pass, when the first image is of size (128, 1600), the weights are getting initialized with the size (800, 16384) and when I am trying to give image of different size, I am getting the above error.
Here 800 is: 2 (bidirectional) * 2 (Num LSTM layers) * 200 (LSTM Hidden Units)

How to resolve this issue and make LSTM handle with images of different sizes.

Any suggestions will be helpful.

Thanks in advance,
Harathi

safrooze · July 25, 2018, 8:51pm

I believe you’re using Gluon. When you create your gluon.rnn.LSTM layer, do you specify that the layout as 'TNC'?

harathi · July 25, 2018, 9:58pm

Hi @safrooze,

Thanks for replying…

I didnt specify layout in LSTM layer. But i think the default is ‘TNC’ according to the Gluon LSTM docs and i am giving the input to LSTM in the same format (‘TNC’).

Thanks,
Harathi

safrooze · July 25, 2018, 10:55pm

@harathi You’re correct. I just read your question in detail and what you’re trying to do is invalid. Your network must have a fixed weight size. A few potential solutions:

Pad smaller images to a fixed large size
Scale images to a fixed size
max-pool or avg-pool the feature vectors of each sequence element into a single element.

Also looking at how you split your array, I believe you won’t get what you expect. I believe you’re trying to split the array such that each element of the sequence would be 8 columns of the image (i.e. 32x64x8). If that’s what you want, correct thing to do would be to split on last axis and then flatten the remaining axes. Alternatively you can transpose from (N,C,H,W) to (N,W,H,C) and flatten/split the same way that you do right now.

harathi · July 25, 2018, 11:35pm

Thanks @safrooze,

That means I need to split the array along Width of the image before flattening it. Am i getting it correct?

If i have some images too small (30, 175) and some images too large (150, 1500), will padding smaller images to larger affect model learning on smaller images?

If you don’t mind, can you please explain this point…

Thanks,
Harathi

safrooze · July 26, 2018, 4:05am

Yes split along width before flattening.
Learning is not impacted. Your network has to have the capacity to learn different feature sizes. Just make sure the image sizes that you see during training are representative of image sizes that are presented during inference.
If each sequence element is, say, (batchx32x64x8), you can flatten it to (batchx32x512) and apply MaxPool1D or AvgPool1D to get a single (batchx32) vector for that sample that is independent of the image dimension.

harathi · July 26, 2018, 4:13pm

@safrooze, thanks a lot

I will try this and let you know if I get any errors.

Thanks,
Harathi

harathi · July 26, 2018, 8:04pm

Hi @safrooze,

To get a vector (batchx32) from (batchx32x512), we need to give stride size as the width (here 512) in MaxPool1D. Please correct me if i am wrong…

Thanks,
Harathi

safrooze · July 26, 2018, 8:30pm

No, you’d want to set pool_size to 512 so that it would find the maximum of each channel within the 512 values.

harathi · July 26, 2018, 8:39pm

Oh ok, got it…
Thanks @safrooze

harathi · July 26, 2018, 8:48pm

Hi @safrooze,

Can i do the same with mx.nd.Pooling() instead of mx.gluon.nn.ManPool1D?
I did it as follows: x is convolution output (BATCH_SIZE, Channels, H1, W1)

    seqs = x.split(num_outputs=SEQ_LEN, axis = 3) # (SEQ_LEN, N, CHANNELS, 
    HEIGHT, WIDTH/SEQ_LEN)
    pooled_seqs = []
    for seq in seqs:
        seq = seq.reshape((seq.shape[0], seq.shape[1], seq.shape[2]*seq.shape[3]))
        pool_seq = mx.nd.Pooling(seq, kernel = 2, pool_size = seq.shape[2])
        pooled_seqs.append(pool_seq)
    x = nd.concat(*[elem.expand_dims(axis=0) for elem in pooled_seqs], dim=0)
    x = x.reshape((x.shape[0], x.shape[1], x.shape[2]))  #(SEQ_LEN, BATCH_SIZE, Channels)
    x = self.lstm(x)

I have a doubt here. Actually, I am getting 512*32 = 16384 features for every sequence and i am reducing it to 32 features. Will it not impact the model performance?

Sorry, i am asking too many questions.

Thanks,
Harathi

Topic		Replies	Views
Inconsistent shape following an LSTM fed with variable length inputs Gluon	3	614	April 25, 2020
LSTM shape error Discussion	1	746	December 28, 2018
Nan in loss after several epochs in SemSeg problem Gluon	4	3292	May 7, 2018
CNN and invariance to feature translation on the image Discussion	1	622	September 26, 2018
Training rcnn with image size of 31*512, What values of anchor box scales and ratio, min_size are required	3	545	July 20, 2018

How to make LSTM handle with images of different sizes?

Related Topics