LSTM Split0 Operator Error


#1

Reposting from Github Issue Upon Request:

Description

Getting an error in the split0 operator when training an image captioning network in mxnet.

Package used (Python/R/Scala/Julia): I’m using Python

Error Message:

---------------INFO-----------------------
vocab_size:663
sentence_length:46
-----------------------------------------

Creating Iterators...
Initiating Training...
INFO:root:Epoch[0] Train-perplexity=655.513238
INFO:root:Epoch[0] Time cost=1.261
infer_shape error. Arguments:
  image_feature: (50, 1024)
  word_data: (50, 77)
  softmax_label: (50,)
Traceback (most recent call last):
  File "2_train_val.py", line 102, in <module>
    epoch_end_callback=mx.callback.do_checkpoint(checkpoints_prefix, period=10)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/base_module.py", line 528, in fit
    batch_end_callback=eval_batch_end_callback, epoch=epoch)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/base_module.py", line 244, in score
    self.forward(eval_batch, is_train=False)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/module.py", line 608, in forward
    self.reshape(new_dshape, new_lshape)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/module.py", line 470, in reshape
    self._exec_group.reshape(self._data_shapes, self._label_shapes)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 381, in reshape
    self.bind_exec(data_shapes, label_shapes, reshape=True)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/module/executor_group.py", line 357, in bind_exec
    allow_up_sizing=True, **dict(data_shapes_i + label_shapes_i))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/executor.py", line 402, in reshape
    arg_shapes, _, aux_shapes = self._symbol.infer_shape(**kwargs)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 989, in infer_shape
    res = self._infer_shape_impl(False, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1119, in _infer_shape_impl
    ctypes.byref(complete)))
  File "/home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py", line 146, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator split0: [18:11:40] src/operator/./slice_channel-inl.h:208: Check failed: dshape[real_axis] % param_.num_outputs == 0U (31 vs. 0) You are trying to split the 1-th axis of input tensor with shape [50,78,256] into num_outputs=47 evenly sized chunks, but this is not possible because 47 does not evenly divide 78

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x276938) [0x7f446310f938]
[bt] (1) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x276d48) [0x7f446310fd48]
[bt] (2) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x286ccb7) [0x7f4465705cb7]
[bt] (3) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25eab07) [0x7f4465483b07]
[bt] (4) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x244274f) [0x7f44652db74f]
[bt] (5) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2445268) [0x7f44652de268]
[bt] (6) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(MXSymbolInferShape+0x1539) [0x7f4465260659]
[bt] (7) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f4487cd1ec0]
[bt] (8) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f4487cd187d]
[bt] (9) /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f4487ee6dee]

Minimum reproducible example

I’m using the code from the following repository: https://github.com/saicoco/mxnet_image_caption

Everything is identical, except I use my own dataset. I’ve preprocessed the data identical to what this implementation expects (originally used the Flickr8k dataset).

What have you tried to solve it?

I basically need some help trying to understand where this error is coming from – in particular, why param_.num_output is set to 47.

  1. The error message is thrown here: https://github.com/apache/incubator-mxnet/blob/master/src/operator/slice_channel-inl.h#L208

  2. num_outputs seems to be set here: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/module/base_module.py#L385 … although this happens after self.forward is called, but the error message seems to be thrown before num_outputs is set


#2

Hi @pn-train,

Given this error happens during the scoring of the model, I’d double check your evaluation data if I were you.
As a simple test, you could calling fit with the eval_data set to your training data and see if that runs (obviously ignoring the metrics returned!). If it does, confirm that you’re applying the same pre-processing steps to your evaluation data. If it doesn’t, could you provide a few more details on the processing of your data, including some samples ideally.