How to optimize the GPU memory usage for deep neural network?



I have a neural network model with two hidden layers. The input has dimension of 240k, and each hidden layer has 1024 and 512 nodes separately. The output dimension size is also 240k. It’s a fully connected neural network and the optimizer is Adam. The batch size is 128. The logic was implemented using python with mxnet 0.11.0.

At runtime the GPU memory consumption is about 6G, which is much larger than the theoretical calculation. Suppose each coefficient use float data type and occupies 4 bytes. The input layer needs memory of 1024 * 237k * 4 = 0.9G. The output layer needs about 0.45G. Even plus each batch of data, the total memory consumption should not exceed than about 1.5G. May I ask what I was missing for the calculation?

Is there any other suggestion to improve the GPU memory usage and accommodate even larger model? Really appreciate that.


You need to account for the weight matrix, which is 240k*1024 for input/output each. Also when using adam there are two states for each weight matrix. Each state cost as much as the weight does.

If you use sgd without momentum there won’t be state cost


A lot of thanks for the reply.

The weight matrix for input layer cost 1024 * 240k * 4=1G memory;
The weight matrix for output layer cost 512 * 240k * 4=0.5G memory;
The Adam optimizer triple the memory usage to be 1.5G * 3=4.5G, is that correct?

Then there is still about 1.5G memory occupied. The input and output data in total are 1282240k=0.25G. Any other hint please?

Also does the sparse feature of mxnet help here if the input is indeed sparse?


Is the observed memory usage at training time? If so there’s additional storage for gradients etc.


Yes, the GPU memory remains constant during all batches and epochs.


you will need space allocated to cache activation values from forward pass that will be used during backward pass to compute gradients.
It is possible to optimize memory usage at the cost of doing additional forward passes (to recompute those activation values). Mu has a slide about this here: (# 14)


Here’s what you might want to do instead (as @piiswrong already mentioned, the memory footprint is due to the embedding matrix): use an explicit embedding only for the head of the distribution and decode the tail as you go. At least, if you have a language dataset you might be able to address this by using a character-LSTM for the tail. That said, this is a much more complex model.