Getting error in gluon model after training for 20 epochs


#1

Hi,

I am using MXNet gluon module for implementing seq2seq attention based neural language correction. The model is training for 10 epochs and after that I am getting the following error:

**"MXNetError: [19:12:27] include/mxnet/././tensor_blob.h:257: Check failed: this->shape_.Size() == shape.Size() (151 vs. 150) TBlob.get_with_shape: new and old shape do not match total elements
**

I am getting this error while printing the loss:
l_sum += l.asscalar()

Full stack trace:

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x308362) [0x7efc4ffc0362]
[bt] (1) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x308938) [0x7efc4ffc0938]
[bt] (2) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x36ef49) [0x7efc50026f49]
[bt] (3) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x280babc) [0x7efc524c3abc]
[bt] (4) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x29c0926) [0x7efc52678926]
[bt] (5) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x293e123) [0x7efc525f6123]
[bt] (6) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2946524) [0x7efc525fe524]
[bt] (7) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x294a071) [0x7efc52602071]
[bt] (8) /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(+0x2946beb) [0x7efc525febeb]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7efcc834cc80]

I thought that one of the reason may be the memory issue. How to resolve this error.

Model summary:
Encoder has 2 LSTM layers with hidden size 200
Attention mechanism
Decoder has 2 LSTM layers with hidden size 200
Maximum_Sequence_length = 150
Total training sentences: 3500

Any suggestions on how to resolve this error…

Thanks in advance,
Harathi


#2

Any suggestions to resolve this error…

Thanks,
Harathi


#3

Hi @harathi,

I don’t think your issue is actually with l_sum += l.asscalar(), but elsewhere. Operations in MXNet are asynchronous and certain operations like asscalar (asnumpy too) are blocking, so it gives the appearance that the error occurs after asscalar has been called.

Given your maximum sequence length is 150 and you’re getting an error from a shape with shape 151, could it be that certain sequences haven’t been clipped/padded correctly? Strange that it works for 10 epochs though. Are you performing validation for the first time after 10 epochs?

One method of debugging in Gluon is to break on exception, and work your way up the stack trace to diagnose the issue. You can reference the shape and values of the arrays which is incredibly useful.