How to optimize the GPU memory usage for deep neural network?

mg0880gm · October 3, 2017, 4:28pm

Hi,

I have a neural network model with two hidden layers. The input has dimension of 240k, and each hidden layer has 1024 and 512 nodes separately. The output dimension size is also 240k. It’s a fully connected neural network and the optimizer is Adam. The batch size is 128. The logic was implemented using python with mxnet 0.11.0.

At runtime the GPU memory consumption is about 6G, which is much larger than the theoretical calculation. Suppose each coefficient use float data type and occupies 4 bytes. The input layer needs memory of 1024 * 237k * 4 = 0.9G. The output layer needs about 0.45G. Even plus each batch of data, the total memory consumption should not exceed than about 1.5G. May I ask what I was missing for the calculation?

Is there any other suggestion to improve the GPU memory usage and accommodate even larger model? Really appreciate that.

piiswrong · October 3, 2017, 7:43pm

You need to account for the weight matrix, which is 240k*1024 for input/output each. Also when using adam there are two states for each weight matrix. Each state cost as much as the weight does.

If you use sgd without momentum there won’t be state cost

mg0880gm · October 3, 2017, 8:06pm

A lot of thanks for the reply.

The weight matrix for input layer cost 1024 * 240k * 4=1G memory;
The weight matrix for output layer cost 512 * 240k * 4=0.5G memory;
The Adam optimizer triple the memory usage to be 1.5G * 3=4.5G, is that correct?

Then there is still about 1.5G memory occupied. The input and output data in total are 1282240k=0.25G. Any other hint please?

Also does the sparse feature of mxnet help here if the input is indeed sparse?

simonco · October 3, 2017, 9:15pm

Is the observed memory usage at training time? If so there’s additional storage for gradients etc.

mg0880gm · October 3, 2017, 9:34pm

Yes, the GPU memory remains constant during all batches and epochs.

madjam · October 3, 2017, 11:18pm

you will need space allocated to cache activation values from forward pass that will be used during backward pass to compute gradients.
It is possible to optimize memory usage at the cost of doing additional forward passes (to recompute those activation values). Mu has a slide about this here: https://mli.github.io/cvpr17/gluon_part2.pdf (# 14)

smolix · October 4, 2017, 7:30pm

Here’s what you might want to do instead (as @piiswrong already mentioned, the memory footprint is due to the embedding matrix): use an explicit embedding only for the head of the distribution and decode the tail as you go. At least, if you have a language dataset you might be able to address this by using a character-LSTM for the tail. That said, this is a much more complex model.

Topic		Replies	Views
Is it normal that mxnet takes up much more GPU memory at the start up? Discussion	3	2897	May 30, 2018
GPU memory in training vs bind/load Performance	2	902	October 12, 2018
GPU memory usage	18	4613	November 23, 2017
Extreme memory usage Performance	1	519	December 20, 2020
Is it possible to reuse GPU's memory when training a network? Gluon	3	1170	August 10, 2018

How to optimize the GPU memory usage for deep neural network?

Related Topics