Training a grapheme to phoneme with mxnet


i’m trying to train a persian g2p model.
i have a csv file containing the data which is look like word,pos,phoneset
now, i read it with python’s csv module and try to encode it as a one-hot representation
but, i got the error saying “ValueError: Setting an array element with a sequence”
i know my arrays are with different size, but i am confused on how can i fix that error
this is my preprocessing and training code:

import csv
from seq2seq import *
from scikit_learn import *
import numpy as np
import mxnet as mx
from mxnet import autograd, gluon, nd

print("preprocessing data...")

x, y = [], []

# this function converts each charactor to it's ascii and returns an nd.array
def convert_ascii(t):
	return nd.array([ord(c) for c in t])

with open("l1.csv", "r") as f:
	s = 0
	r = csv.reader(f)
	for row in r:
		a = [nd.one_hot(convert_ascii(row[0]), depth=32), nd.one_hot(convert_ascii(row[1]), depth = 10)]
		b = nd.one_hot(convert_ascii(row[2]), depth = 32)
		s += 1

x = np.array(x, dtype = np.float32)
y = np.array(y, dtype = np.float32)

net = seq2seq(s, x.size, 5000000, 5000000)

# train
print("training the data...")
clf = GluonClassifier(model = net, loss_function = gluon.loss.SoftmaxCrossEntropyLoss, init_function = mxnet.initializer.Xavier, batch_size = 256, epochs = 1000000, verbose = True), y)

the GluonClassifier is my scikit-learn like wrapper to gluon, and seq2seq is a lstm sequence2sequence model
thanks in advance.


Hi @brightening-eyes
You can use padding so that each of your word representation is the same length.
Your depth value is wrong, since ascii characters can have 256 values. You can either update it to 256 or use an alphabet like alphabet="abcdef" and use the index in that array to get smaller input size.
Even better I would suggest using an embedding layer rather than manually one-hot encoding your characters.

I am also not too sure what you want to do with row[0] and row[1] since they are both part of the input, do you want to combine them as a single input, or keep them split into two inputs as x1, x2 for example?

If you want to combine them, one solution is to separate them by a special character as has been done in several recent papers like ULMfit etc.

Here is an example implementation

import csv
import numpy as np
import mxnet as mx
from mxnet import autograd, gluon, nd

print("preprocessing data...")

def convert_ascii(t):
   """ this function converts each character to its ascii value and returns an nd.array """
    return nd.array([ord(c) for c in t])

num_lines = len(open("l1.csv", "r").readlines())

max_len_x = 30
max_len_y = 20
ascii_num = 256

# Create the buffers
x = nd.zeros((num_lines, max_len_x, ascii_num))
y = nd.zeros((num_lines, max_len_y, ascii_num))

with open("l1.csv", "r") as f:
    r = csv.reader(f)
    for i, row in enumerate(r):
        # Combine word and pos into a single input, separated by special character ' | '
        x_ = nd.one_hot(convert_ascii(row[0]+"|"+row[1]), depth = ascii_num)
        y_ = nd.one_hot(convert_ascii(row[2]), depth = ascii_num)
        x[i, :x_.shape[0]] = x_
        y[i, :y_.shape[0]] = y_
print(x.shape, y.shape)
((3, 30, 256), (3, 20, 256))


thanks for your reply.
persian’s alphabet is utf-8, like, “آ” is 1570 which is greater than 255
but, in persian we have 32 alphabets which i’ve set that depth to 32
row[1] is the pos for it. like, in persian we have “مرد” as both verb and noun
if thats a verb, it is pronounsed as “m o r d”. while if it is a noun, it is pronounced as “m A r d”
consider this: “آن مرد مرد.” which means that man died.
now, i want row[0] and row[1] to be independent of each other. this was one of my problems.
thanks again.


the depth argument is about how many distinct values you are using. I would suggest using a

alphabet = ["۱۲۳۴۵۶۷۸۹،ـ؟ﺁﺋﺀﺍﺎﺏﺑﭖﭘﺕﺗﺙﺛﺝﺟﭺﭼﺡﺣﺥﺧﺩﺫﺭﺯﮊﺱﺳﺵﺷﺹﺻﺽﺿﻁ"]
alphabet_to_index = {char:index for index,char in enumerate(alphabet)}

def convert_ascii(t):
   """ this function converts each character to its index and returns an nd.array """
    return nd.array([alphabet_to_index[c] for c in t])

I think you have 2 options, you can either combine them as I suggested and your seq2seq model, for example if it is attention based, will learn to use attention on the row[1] part of the input and combine it with the row[0] part to get the right phoneme. Alternatively you can use conditioning, on the decoding part of your network you can for example concatenate the output of your encoder to the row[1] information and pass it to your decoder. For that your network should take 2 input. You might need to revisit your sklearn wrapper implementation to split X if it is a tuple and pass it to your network that way.


hi again,
thanks. my problem regarding preprocessing the data were fixed.
now, i get an error with my model that says,

Traceback (most recent call last):
  File "", line 47, in <module>
    net = seq2seq(num_lines, a_size, max_x_length, max_y_length)
  File "E:\projects\python\tts\", line 18, in __init__
    self.decoder = nn.Dense(size, in_units = num_hidden, params = self.encoder.params)
  File "C:\python36\lib\site-packages\mxnet\gluon\nn\", line 193, in __init__
  File "C:\python36\lib\site-packages\mxnet\gluon\", line 611, in get
    name, k, str(v), str(getattr(param, k)))
AssertionError: Cannot retrieve Parameter 'seq2seq0_embedding0_weight' because desired attribute does not match with stored for attribute 'shape': desired '(18521, 255)' vs stored '(18521, 143)'.

i don’t know what to do, since my x and y are in different shapes (because my alphabets and phonesets are in different shapes)
this is the code to my seq2seq model:

import numpy as np
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn, rnn

class seq2seq(gluon.Block):
	"""A model with an encoder, recurrent layer, and a decoder."""

	def __init__(self, size, num_embed, num_hidden, num_layers, dropout=0.5, **kwargs):
		super(seq2seq, self).__init__(**kwargs)
		with self.name_scope():
			self.drop = nn.Dropout(dropout)
			self.encoder = nn.Embedding(size, num_embed, weight_initializer = mx.init.Uniform(0.1))
			self.rnn = rnn.LSTM(num_hidden, num_layers, dropout=dropout, input_size=num_embed)
			self.decoder = nn.Dense(size, in_units = num_hidden, params = self.encoder.params)
			self.num_hidden = num_hidden

	def forward(self, inputs, hidden):
		emb = self.drop(self.encoder(inputs))
		output, hidden = self.rnn(emb, hidden)
		output = self.drop(output)
		decoded = self.decoder(output.reshape((-1, self.num_hidden)))
		return decoded, hidden

	def begin_state(self, *args, **kwargs):
		return self.rnn.begin_state(*args, **kwargs)



I don’t quite understand why you are sharing your params between the decoder and encoder? And if you wanted to do so, you would need to have the shape matching indeed.


hi again,
i didn’t know that.
now, for the same part, what would you recommend me to do instead of sharing the params?
you mean not to pass the params parameter in my decoder’s constructor?