Fine tune a pre-trained language model with gluonnlp

FraPochetti · October 4, 2018, 1:55pm

I am trying to fine tune a pre-trained awd_lstm_lm_1150 Language Model to fit a dataset of my own, building on this tutorial.
Here is how the original model looks like when loaded from the gluonnlp zoo:

dataset_name = 'wikitext-2'
awd_model_name = 'awd_lstm_lm_1150'
awd_model, voc = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True)
print(awd_model)
print(voc)

>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
>>> Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

As my own vocab is of len 1031, I’d like to either

edit the last Dense layer to output 1031 classes
add an additional Dense layer on top, like the following: nn.Dense(in_units=33278, units=1031)

I cannot seem to figure out how to achieve #1.

As for #2 (less optimal option by definition) I declared a Sequential model, like this:

net = nn.Sequential()
net.add(awd_model)
net.add(nn.Dense(in_units=33278, units=1031))

but then the training process errors out as the Sequential object lacks several attributes the original gluonnlp.model.language_model.AWDRNN object featured.

For instance, running the train function from the tutorial linked on top, I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-4b4b6bfe8cc9> in <module>()
----> 1 train(net, farewell_train_data, epochs, lr=0.01)

<ipython-input-29-7991a5aa7704> in train(model, train_data, epochs, lr)
     22         start_log_interval_time = time.time()
     23         hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
---> 24                    for ctx in context]
     25         for i, (data, target) in enumerate(train_data):
     26             data_list = gluon.utils.split_and_load(data, context,

<ipython-input-29-7991a5aa7704> in <listcomp>(.0)
     22         start_log_interval_time = time.time()
     23         hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx)
---> 24                    for ctx in context]
     25         for i, (data, target) in enumerate(train_data):
     26             data_list = gluon.utils.split_and_load(data, context,

AttributeError: 'Sequential' object has no attribute 'begin_state'

Do you guys have any clue?

thomelane · October 5, 2018, 5:45pm

EDITED: Corrected Dense layer as suggested by @FraPochetti in later comments.

Hi @FraPochetti,

As you’ve already spotted, there’s a problem with the type of model you’ve created. AWDRNN is a recurrent network that is intended to be used one step at a time, but Sequential models don’t have the methods required to do this: begin_state being one of them.

One great thing about GluonNLP models is their structure. You can see from the print out that this model has been designed with 3 components: embedding, encoder and decoder. And they are just properties of the AWDRNN Block so I believe you can get and set them.

awd_model.decoder

# HybridSequential(
#   (0): Dense(400 -> 33278, linear)
# )

Since you’re interested in replacing the last layer, you can create your own Sequential Block containing the Dense Block, and then set the decoder to use this instead.

new_decoder = mx.gluon.nn.HybridSequential()
new_decoder.add(mx.gluon.nn.Dense(units=1031, flatten=False))
new_decoder.initialize()

awd_model.decoder = new_decoder

You can then create a Trainer object in such a way so that only the new_decoder gets trained and all of the other pre-trained weights (from the embedding and encoder) remained fixed.

FraPochetti · October 6, 2018, 7:45am

Thanks @thomelane! This makes perfect sense. I was trying to change the output of the layer and did not think to replace it altogether.

FraPochetti · October 7, 2018, 8:57am

I am actually facing another problem now. When I feed my text data into the newly defined network (i.e. after replacing the decoder) the decoder drops the batch dimension during the forward pass (i.e. the model output is an array of 2 dimensions (bptt, vocab) instead of 3 (bptt, batch, vocab)).

Here what I mean.

Details

bptt = 35
batch_size = 20

This is what happens when I feed my data into the ORIGINAL NETWORK

dataset_name = 'wikitext-2'
awd_model_name = 'awd_lstm_lm_1150'
awd_model, voc = nlp.model.get_model(
    awd_model_name,
    vocab=vocab,
    dataset_name=dataset_name,
    pretrained=True)
print(awd_model)
print(voc)

>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(400 -> 33278, linear)
  )
)
>>> Vocab(size=33278, unk="<unk>", reserved="['<eos>']")

model = awd_model
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx) for ctx in context]

data, target = next(iter(farewell_train_data))

data_list = gluon.utils.split_and_load(data, context, batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context, batch_axis=1, even_split=True)
hiddens = detach(hiddens)

X, y, h = data_list[0], target_list[0], hiddens[0]

output, h = model(X, h)

print(data.shape, target.shape)
print(X.shape, y.shape)
print(output.shape)

>>> (35, 20) (35, 20)
>>> (35, 20) (35, 20)
>>> (35, 20, 33278)

This is what happens when I feed my data into the NEW NETWORK, i.e. after replacing the decoder

new_decoder = mx.gluon.nn.HybridSequential()
new_decoder.add(mx.gluon.nn.Dense(units=1031))
new_decoder.initialize()
awd_model.decoder = new_decoder
print(awd_model)

>>> AWDRNN(
  (embedding): HybridSequential(
    (0): Embedding(33278 -> 400, float32)
    (1): Dropout(p = 0.65, axes=(0,))
  )
  (encoder): Sequential(
    (0): LSTM(400 -> 1150, TNC)
    (1): LSTM(1150 -> 1150, TNC)
    (2): LSTM(1150 -> 400, TNC)
  )
  (decoder): HybridSequential(
    (0): Dense(None -> 1031, linear)
  )
)

model = awd_model
hiddens = [model.begin_state(batch_size//len(context), func=mx.nd.zeros, ctx=ctx) for ctx in context]

data, target = next(iter(farewell_train_data))

data_list = gluon.utils.split_and_load(data, context, batch_axis=1, even_split=True)
target_list = gluon.utils.split_and_load(target, context, batch_axis=1, even_split=True)
hiddens = detach(hiddens)

X, y, h = data_list[0], target_list[0], hiddens[0]

output, h = model(X, h)

print(data.shape, target.shape)
print(X.shape, y.shape)
print(output.shape)

>>> (35, 20) (35, 20)
>>> (35, 20) (35, 20)
>>> (35, 1031)

I checked step by step and it is indeed the new decoder which is messing up with the sizes.
Up to the encoder, the 2 networks spit out the exact same arrays.
@thomelane do you have any idea of what might be happening?
Thanks a ton in advance, man!

FraPochetti · October 8, 2018, 2:31pm

@thomelane figured it out.

The new_encoder should not flatten the output out.
If I use the below decoder, the batch dimension “re-appears back”.
new_decoder.add(mx.gluon.nn.Dense(units=1031, flatten=False)).

Could you maybe edit your reply accordingly, please?

thomelane · October 22, 2018, 8:16pm

Many thanks for the update @FraPochetti, updated my answer above. And now I know the purpose of your language model Good luck in your new role!

Topic		Replies	Views
Gluon NLP: How to import my own pretrained W2V? Gluon	1	1056	December 16, 2018
Any tutorial for Gluon models?	10	1597	November 7, 2018
Language models and BeamSearch/SeqGenaration	2	523	October 23, 2019
Modifying pre-trained gluon model zoo Gluon	1	1011	July 20, 2019
RNN explanation and input data format Gluon	0	357	March 19, 2019

Fine tune a pre-trained language model with gluonnlp

Related Topics