Hello,
I’m new on MXNet and in DL field in general. For a project of mine I’m trying to implement Tacotron on Python MXNet.
For a newbie like me, is a kind of difficult task but it would be very useful to have some hints on the C(onvolutional 1D filters)B(ank)H(ighway networks)G(ated recurrent unit bidirectional).
Right now I’m trying to use the CBHG for predict linear scale spectrograms from mel spectrograms (this is the last part of the Tacotron system). This module is used on the encoder part to process embedding of size 256
My data shapes are like: (batch_size, num_bands, num_time_frames)
So linear spectrograms got num_bands = 1025 and Mel spectrograms got num_bands=80
num_time_frames is fixed to the maximum audio file length among all my audio data
I’m stuck on make shapes compatibles during all steps of the CBHG module. Those steps consist of:

Create a bank of K stacked convolutional filter which each k filter got a kernel width of k (1° filter: kernel width=1, 2° filter: kernel width=2…)

Maxpooling

other 2 convolution for projection: 1° with kernel_size=3, num_filter = 256, 2° with kernel_size=3, num_filter = 80

Highway net, 4 layers fullyconnected, num_hidden=128

Bidirectional GRU, 128 cells
This is my code:
Convolution bank of K filter, emb_size=256
def conv1dBank(conv_input, K): conv=mx.sym.Convolution(data=conv_input, kernel=(1,), num_filter=emb_size//2) (conv, mean, var) = mx.sym.BatchNorm(data=conv, output_mean_var=True) conv = mx.sym.Activation(data=conv, act_type='relu') for k in range(2, K+1): convi = mx.sym.Convolution(data=conv_input, kernel=(k,), num_filter=emb_size//2) (convi, mean, var) = mx.sym.BatchNorm(data=convi, output_mean_var=True) convi = mx.sym.Activation(data=convi, act_type='relu') conv = mx.symbol.concat(conv,convi,dim=2) #TODO: Need to concat on the num_filter dimension! return conv
highway
def highway_layer(data):
H= mx.symbol.Activation(
data=mx.symbol.FullyConnected(data=data, num_hidden=emb_size//2, name="highway_fcH"),
act_type="relu"
)
T= mx.symbol.Activation(
data=mx.symbol.FullyConnected(data=data, num_hidden=emb_size//2, bias=mx.sym.Variable('bias'), name="highway_fcT"),
act_type="sigmoid"
)
return H * T + data * (1.0  T)
CBHG
def CBHG(data,K,proj1_size,proj2_size):
bank = conv1dBank(data,K)
poold_bank = mx.sym.Pooling(data=bank, pool_type='max', kernel=(2,), stride=(1,), name="CBHG_pool")
proj1 = mx.sym.Convolution(data=poold_bank, kernel=(3,), num_filter=proj1_size, name='CBHG_conv1')
(proj1, proj1_mean, proj1_var) = mx.sym.BatchNorm(data=proj1, output_mean_var=True, name='CBHG_batch1')
proj1 = mx.sym.Activation(data=proj1, act_type='relu', name='CBHG_act1')
proj2 = mx.sym.Convolution(proj1, kernel=(3,), num_filter=proj2_size, name='CBHG_conv2')
(proj2, proj2_mean, proj2_var) = mx.sym.BatchNorm(data=proj2, output_mean_var=True, name='CBHG_batch2')
residual= proj2 + data #Error here: incompatible shapes
for i in range(4):
residual = highway_layer(residual)
highway_pass = residual
bidirectional_gru_cell = mx.rnn.BidirectionalCell(
mx.rnn.GRUCell(num_hidden=emb_size//2, prefix='CBHG_gru1'),
mx.rnn.GRUCell(num_hidden=emb_size//2, prefix='CBHG_gru2'),
output_prefix='CBHG_bi_'
)
outputs, states = bidirectional_gru_cell.unroll(1, inputs=highway_pass, merge_outputs=True)
return outputs
So, if I infer the shape with a dummy ndarray with shape (batch_size, num_bands, time_frames) I got error on incompatible shapes during the residual sum
in_cbhg = mx.sym.Variable("in_cbhg")
in_cbhg_shape = (2,80,100)
CBHG(in_cbhg,hp.decoder_num_banks,hp.embed_size,hp.n_mels).infer_shape(in_cbhg=in_cbhg_shape)
infer_shape error. Arguments:
in_cbhg: (2, 80, 100)
Incompatible attr in node _plus70 at 1th input: expected (2,80,767), got (2,80,100)
How can I solve it? What’s wrong with my code? Are my input data shaped right?
Here some CBHG implementation on TFlow: