Memory allocation of Parameters


#1

I am trying to train a network that doesn’t store parameters and generate them on the fly, I would like to understand how mxnet allocate memory for parameters, for example if I use
W1 = mx.gluon.Parameter('W1',shape=(num_filter_conv_layer1, 1, 3,3)) without initialization, does it allocate memory for it?


#2

Hi @momran,

If you don’t intend to store parameters (i.e. retain after forward pass) and want to generate them ‘on the fly’ (i.e. during the forward pass) then I don’t think you need to be using Parameter here. Parameter is used to hold parameters whose state will be maintained and updated during training. Sounds like you could just operate on the input to the Block to generate this (i.e. ‘on the fly’). Unless I have misunderstood what you’re trying to do!

With regards to memory allocation for the parameters (if you do keep as a Parameter), if the shape is known when defining the Parameter then the memory will be allocated when you call initialize on the block. If its shape depends on the input, then the initialization will be lazy and the memory will be allocated the first time data is passed to the Block.

Check out this tutorial for more information.


#3

@thomelane So what I need to do is that during training, at the beginning, I will initialize the parameters by a random function in which I just give it a seed and it generates this number on the fly for each parameter, then I will calculate the gradient and only save the accumulated gradient of some parameters and add it to the recomputed numbers that I regenerate again in the next training step so I don’t have to allocate memory for the parameters, what is stopping me though is how to build the network layers without assigning allocated paramers to it, can I do that?


#4

Would you be able to explain what you’re trying to achieve at a high level? And could you share an example of your implementation of your Block so far so I can get a better understanding, thanks!

With Parameter, the memory for the values and gradients is allocated when you initialize (if shape is known) and when you call forward on the Block for the first time when shape has to be inferred. You might not want to use Parameter for this though, but instead manage the NDArrays yourself, and call attach_grad as required.


#5

@thomelane
What I want to do at high level is the following,
training step1:

  1. do the forward path by generating the weights randomly on the fly using random function with giving a seed to each one of them so I can regenerate them and no need to store them.
  2. during backprop, I calculate the gradient and store only part of it and neglect the rest.

training step2:

  1. during forward, recompute the random weights by giving the same seed, and adding the stored gradients to the ones that needs update.
  2. same for step1.
    In this way I need only to save the accumulated gradients as a storage in memory.

I can see that Parameter allocates memory when initialized, but also when using NDArrays as my parameters, in this way I need to allocate space for them and the trainer needs parameters as input,
do you suggest that I edit the inputs to the Trainer, so that I can only pass the gradients to the trainer?

Hope I explained it properly, let me know if you didn’t get what I am trying to do.


#6

Okay, based on what you’ve said, how does the following look? I think it follows the flow you outlined.

import mxnet as mx

def generate_weights(seed):
    mx.random.seed(seed)
    return mx.nd.random.uniform(high=1, shape=(2,2))

def step(cumulative_gradient):
    weights = generate_weights(42)
    print("\n weights before update:", weights)
    learning_rate = 0.1
    # update step for subset of weights (standard sgd)
    weights[0,:] -= learning_rate * cumulative_gradient
    print("\n weights after update:", weights)
    weights.attach_grad()
    with mx.autograd.record():
        # dummy loss function to make all weights close to 1
        loss = mx.nd.sum(mx.nd.square(mx.nd.ones(shape=(2,2)) - weights))
    loss.backward()
    # cumulate a subset of the weights' gradient, where subset is just first row of weights
    cumulative_gradient += weights.grad[0,:]
    print("\n cumulative_gradient:", cumulative_gradient)
    return cumulative_gradient

cumulative_gradient = mx.nd.zeros(shape=(2,))
for i in range(10):
    print("\n ##### step", i)
    cumulative_gradient = step(cumulative_gradient)

#7

@thomelane Yes that’s very helpful, thank you.