Sequence to sequence to invert string

Hello folks,

I have modelled a small seq2seq network (without attention) to invert an input string of fixed length (code attached below).
Uses one hot encoded strings.
It seems to converge and correctly predict the output for small vocabularies (<15), but it totally fail to converge for higher order vocabularies.

If I try to use SoftmaxOutput, all one hot encodings go to zero and nothing can be one-hot decoded.

I’d like to know if I did some gross mistake of if it’s just related to the network being to simple to capture sentences made by vocabularies which size is between 50 and 1000.

Many thanks, guys

num_hidden=128
embed_size=256
dataset_size=5000
batch_size = 100

source = mx.sym.Variable('source')
target = mx.sym.Variable('target')
label = mx.sym.Variable('softmax_label')

source_embed = mx.sym.Embedding(
    data=source,
    input_dim=vocab_size_train,
    output_dim=embed_size
)
target_embed = mx.sym.Embedding(
    data=target,
    input_dim=vocab_size_label,
    output_dim=embed_size
)

bi_cell = mx.rnn.BidirectionalCell(
    mx.rnn.GRUCell(num_hidden=num_hidden, prefix="gru1_"),
    mx.rnn.GRUCell(num_hidden=num_hidden, prefix="gru2_"),
    output_prefix="bi_"
)

encoder = (bi_cell)
        
_, encoder_state = encoder.unroll(
    length=max_string_len,
    inputs=source_embed,
    merge_outputs=False
)

encoder_state = mx.sym.concat(encoder_state[0][0],encoder_state[1][0])

decoder = mx.rnn.GRUCell(num_hidden=num_hidden*2)

rnn_output, _ = decoder.unroll(
    length=max_string_len,
    begin_state=encoder_state,
    inputs=target_embed,
    merge_outputs=True
)

flat=mx.sym.Flatten(data=rnn_output)

fc=mx.sym.FullyConnected(
    data=flat,
    num_hidden=max_string_len*vocab_size_label
)
act=mx.sym.Activation(data=fc, act_type='relu')


out = mx.sym.Reshape(data=act, shape=((0,max_string_len,vocab_size_label)))

net = mx.sym.LinearRegressionOutput(data=out, label=label)

# FIT THE MODEL
model = mx.module.Module(net, data_names=['source','target'], context=ctx)

model.fit(
            train_data=train_iter,
            eval_data=eval_iter,
            eval_metric = 'mse',
            optimizer='sgd', optimizer_params={'learning_rate':0.01, 'momentum':0.9},
            initializer=mx.initializer.Xavier(),
            batch_end_callback = mx.callback.Speedometer(batch_size, 10),
            epoch_end_callback = epoch_end_callback,
            num_epoch=max_epoch
       )