Fine tune a pre-trained language model with gluonnlp

I am trying to fine tune a pre-trained awd_lstm_lm_1150 Language Model to fit a dataset of my own, building on this tutorial.
Here is how the original model looks like when loaded from the gluonnlp zoo:

dataset_name = 'wikitext-2'
awd_model_name = 'awd_lstm_lm_1150'
awd_model, voc = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True)
print(awd_model)
print(voc)
>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
>>> Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

As my own vocab is of len 1031, I’d like to either

  1. edit the last Dense layer to output 1031 classes
  2. add an additional Dense layer on top, like the following: nn.Dense(in_units=33278, units=1031)

I cannot seem to figure out how to achieve #1.

As for #2 (less optimal option by definition) I declared a Sequential model, like this:

net = nn.Sequential()
net.add(awd_model)
net.add(nn.Dense(in_units=33278, units=1031))

but then the training process errors out as the Sequential object lacks several attributes the original gluonnlp.model.language_model.AWDRNN object featured.

For instance, running the train function from the tutorial linked on top, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-4b4b6bfe8cc9> in <module>()
----> 1 train(net, farewell_train_data, epochs, lr=0.01)

<ipython-input-29-7991a5aa7704> in train(model, train_data, epochs, lr)
     22         start_log_interval_time = time.time()
     23         hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
---> 24                    for ctx in context]
     25         for i, (data, target) in enumerate(train_data):
     26             data_list = gluon.utils.split_and_load(data, context,

<ipython-input-29-7991a5aa7704> in <listcomp>(.0)
     22         start_log_interval_time = time.time()
     23         hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
---> 24                    for ctx in context]
     25         for i, (data, target) in enumerate(train_data):
     26             data_list = gluon.utils.split_and_load(data, context,

AttributeError: 'Sequential' object has no attribute 'begin_state'

Do you guys have any clue?

EDITED: Corrected Dense layer as suggested by @FraPochetti in later comments.

Hi @FraPochetti,

As you’ve already spotted, there’s a problem with the type of model you’ve created. AWDRNN is a recurrent network that is intended to be used one step at a time, but Sequential models don’t have the methods required to do this: begin_state being one of them.

One great thing about GluonNLP models is their structure. You can see from the print out that this model has been designed with 3 components: embedding, encoder and decoder. And they are just properties of the AWDRNN Block so I believe you can get and set them.

awd_model.decoder

# HybridSequential(
#   (0): Dense(400 -> 33278, linear)
# )

Since you’re interested in replacing the last layer, you can create your own Sequential Block containing the Dense Block, and then set the decoder to use this instead.

new_decoder = mx.gluon.nn.HybridSequential()
new_decoder.add(mx.gluon.nn.Dense(units=1031, flatten=False))
new_decoder.initialize()

awd_model.decoder = new_decoder

You can then create a Trainer object in such a way so that only the new_decoder gets trained and all of the other pre-trained weights (from the embedding and encoder) remained fixed.

Thanks @thomelane! This makes perfect sense. I was trying to change the output of the layer and did not think to replace it altogether.

I am actually facing another problem now. When I feed my text data into the newly defined network (i.e. after replacing the decoder) the decoder drops the batch dimension during the forward pass (i.e. the model output is an array of 2 dimensions (bptt, vocab) instead of 3 (bptt, batch, vocab)).

Here what I mean.

Details

bptt = 35
batch_size = 20

This is what happens when I feed my data into the ORIGINAL NETWORK

dataset_name = 'wikitext-2'
awd_model_name = 'awd_lstm_lm_1150'
awd_model, voc = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True)
print(awd_model)
print(voc)
>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
>>> Vocab(size=33278, unk="<unk>", reserved="['<eos>']")
model = awd_model
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx) for ctx in context]

data, target = next(iter(farewell_train_data))

data_list = gluon.utils.split_and_load(data, context, batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context, batch_axis=1, even_split=True)
hiddens = detach(hiddens)

X, y, h = data_list[0], target_list[0], hiddens[0]

output, h = model(X, h)

print(data.shape, target.shape)
print(X.shape, y.shape)
print(output.shape)
>>> (35, 20) (35, 20)
>>> (35, 20) (35, 20)
>>> (35, 20, 33278)

This is what happens when I feed my data into the NEW NETWORK, i.e. after replacing the decoder

new_decoder = mx.gluon.nn.HybridSequential()
new_decoder.add(mx.gluon.nn.Dense(units=1031))
new_decoder.initialize()
awd_model.decoder = new_decoder
print(awd_model)
>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 1031, linear)
  )
)
model = awd_model
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx) for ctx in context]

data, target = next(iter(farewell_train_data))

data_list = gluon.utils.split_and_load(data, context, batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context, batch_axis=1, even_split=True)
hiddens = detach(hiddens)

X, y, h = data_list[0], target_list[0], hiddens[0]

output, h = model(X, h)

print(data.shape, target.shape)
print(X.shape, y.shape)
print(output.shape)
>>> (35, 20) (35, 20)
>>> (35, 20) (35, 20)
>>> (35, 1031)

I checked step by step and it is indeed the new decoder which is messing up with the sizes.
Up to the encoder, the 2 networks spit out the exact same arrays.
@thomelane do you have any idea of what might be happening?
Thanks a ton in advance, man!

@thomelane figured it out.

The new_encoder should not flatten the output out.
If I use the below decoder, the batch dimension “re-appears back”.
new_decoder.add(mx.gluon.nn.Dense(units=1031, flatten=False)).

Could you maybe edit your reply accordingly, please?

Many thanks for the update @FraPochetti, updated my answer above. And now I know the purpose of your language model :smile: Good luck in your new role!

1 Like