I tried using only one Trainer, setting it up with all model parameters. To optimize two objectives alternately, I alternately computed two losses that depended on different subsets of parameters. It is expected that only these used parameters would have non-zero gradients w.r.t. either of the losses (after calling
loss.backward()). However, with deferred initialization,
Trainer.step() forced me to have full forward passes that used all parameters in advance (besides, all the gradients must also be updated since last
step), which would be very inconvenient. Could someone explain what motivated the design of deferred initialization? What are the difference and relationship between the ‘initialization’ in ‘deferred initialization’ and the ‘initialization’ in
Here’s what happened. When you wrote your network, you didn’t actually specify the full shape information needed to initialize your parameter ndarrays.
For example, perhaps you had a
Dense layer and you specified how many output units it should generate, but you didn’t specify how many input units there were. Without specifying how many input units there are, it’s impossible to know what size the weight matrix should be.
So what’s a deep learning library to do when someone doesn’t specify all the information? Well there are two options.
- Throw up its hands, burn the house down, and yell at the user for not specifying every detail.
- Wait for the user to push data through the network to figure out on its own what the missing input shape information should have been, and then initialize.
Deferred initialization is the much more friendly latter. You didn’t give it all the information and Mxnet said “that’s okay, I’ll wait (defer initialization) until you push data through to figure out the rest.”
But here’s the kicker. In your case, you didn’t push all the possible input data through. Consequently, Mxnet took a look at some of your parameters and balked at the fact that despite a training update being requested, the parameters were never able to be fully resolved!
So you have at least two options.
- Specify all shape information up front so that deferred initialization is not required. (E.g., specify the in units size to the
- Before you begin training, push a random batch of each kind of data through your network so that it can figure out the input sizes itself. Then start training.
Thank you for your answer. I fully understand why deferred initialization should be there now.
Just to add to @jmacglashan’s response, deferred initialization with shape inference is there by design as a convenience to the user. Many of the shape dimensions are dependent on other shape dimensions, so you don’t have to specify them all. For example, given the kernel size, stride and padding type in addition to the input shape, the output shape is completely determined (an unnecessary computation). With shape inference, you can specify a minimal number of dimensions and avoid unnecessary tedious computation.