Hi,

I have a neural network model with two hidden layers. The input has dimension of 240k, and each hidden layer has 1024 and 512 nodes separately. The output dimension size is also 240k. It’s a fully connected neural network and the optimizer is Adam. The batch size is 128. The logic was implemented using python with mxnet 0.11.0.

At runtime the GPU memory consumption is about 6G, which is much larger than the theoretical calculation. Suppose each coefficient use float data type and occupies 4 bytes. The input layer needs memory of 1024 * 237k * 4 = 0.9G. The output layer needs about 0.45G. Even plus each batch of data, the total memory consumption should not exceed than about 1.5G. May I ask what I was missing for the calculation?

Is there any other suggestion to improve the GPU memory usage and accommodate even larger model? Really appreciate that.