https://en.diveintodeeplearning.org/chapter_recurrent-neural-networks/rnn-scratch.html

While it will be clear to the observant reader, it could confuse some people as to why we need a separate `init_rnn_state()`

function.

Let us assume that the first input to the RNN is called X_1 and the first output created by the RNN is called y_1.

Consider the general equation:

\mathbf{H}_t = \phi(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h)

Since t=1, we have:

\mathbf{H}_1 = \phi(\mathbf{X}_1 \mathbf{W}_{xh} + \mathbf{H}_{0} \mathbf{W}_{hh} + \mathbf{b}_h).

Now what is this H_0. This is a state which MUST be present even before RNN starts processing the first input. Hence for the code to work, it is very important that we initialise this. Even if we choose to initialise this to 0, we have to make sure that the dimensions of this matrix are such that the above equations holds true.

If you look closely, you will realise that the dimensions of any H_t is in reality \text{batch_size} \times \text{num_hidden_units}

Since batch_size is not static at the time of creating the network, we need to initialise this H_0 vector at runtime and hence the need for a separate function.

param.grad[:] *= theta / norm

Upon trying to implement the original version I received an error regarding inability to “subscript a function” perhaps it should be param.grad()[:] .

I wonder if there is a convenient way to initialize hidden state after full training, prior to making predictions. If our model is trained and fine-tuned and we wish to obtain **multiple** predictions starting from the beginning-of-sentence token `<bos>`

, we expect them to be different. As far as I can tell, if the hidden state is initialized with zeros (by default) in `predict_ch8`

, the model will predict the same sequences every launch.