Is it possible to reuse GPU's memory when training a network?


#1

Is it possible to reuse GPU’s memory when training a network?
I am following the official instructions to build an SSD (https://gluon-cv.mxnet.io/build/examples_detection/train_ssd_voc.html#sphx-glr-build-examples-detection-train-ssd-voc-py)
When I try to train on GPU. I find that the batch size is limit by the video memory. There are guidelines about how to use many GPUs (http://zh.gluon.ai.s3-website-us-west-2.amazonaws.com/chapter_computational-performance/multiple-gpus.html). Obviously, if I have enough money, I certainly have many GPUs. But, if I have a cheap GPU with a little memory, I will never use big batch sizes. The problem associated with the small batch is that the training process may never converge. Note that the parameters in a neural network are not using at the same time. We can move the in-use parameters to GPU and move others out. This idea is common because we reuse memory when we play games. No game will put all the figures into the GPU at the same time. I suppose that this strategy will slow down the GPU, but it should be faster than using CPU alone. Furthermore, the big batch size can be used.


#2

There used to be a memonger in symbolic mode afaik that can get activated through MXNET_BACKWARD_DO_MIRROR=1, but I have never used it!

Please let us know if it works for you.


#3

I am sorry. The suggested technology is not working. The memory usage is as much as the original condition. My test platform is a windows based “p2.xlarge” virtual machine. Two ways are tested for setting environment. One is to use following codes, and the other is to add the environment manually in the “system tab.” I am sure that I set a numerical value one rather than a string. The test code is collected in https://github.com/BlueBirdHouse/MxNetUdacity. Another professor has suggested a strategy in https://stackoverflow.com/questions/51718438/is-it-possible-to-reuse-gpus-memory-when-training-a-network-with-mxnet.

import os
os.environ['MXNET_BACKWARD_DO_MIRROR'] = '1'

#4

I am not sure if you should be worried about using small minibatches. Some techniques use pure SGD with 1 example over time and still converge. The parallelization of this approach is worse (eg more clock time needed to do all epochs), but I wouldn’t be worried about convergences.

Take a look here - https://stats.stackexchange.com/questions/316464/how-does-batch-size-affect-convergence-of-sgd-and-why I find the answer quite interesting, hope you enjoy reading it as well.