If you use mxnet/gluon to implement a seq2seq model, how do you implement batch training of different length sequences? I saw that a lot of code sets batch_size to 1, which can solve the problem of different EOS locations when decoding each sequence, but in this case, the speed is very slow. How to achieve batch_size greater than 1?
Everything has to fit into an NDArray so if you have different lengths, you’d have to pad them all to the same length. If your output is a softmax, you’d have an extra EOS in the target sequence and you would also mask the gradient of the extra padding on every sequence by multiplying by zero.
Thank you for your reply. I still have some question. Please help me.
Q : If the input to decode is also filled to ensure the same length, is it not EOS to determine the stop status of each sequence during decoding? The calculated length of the decoding will become the length after the filling? Will this bring extra computing overhead?
I don’t quite understand your question. If your question is whether the sequence to sequence network would have to do extra computation during training because of the padding, the answer is yes. However GPUs often have extra computing resource and the computation is run in parallel which often results in no slow down. During inference, however, you have the option of inferring one element at a time and stopping when EOS is observed.