GPU memory garbage collection


#1

I train sequentially different networks, and like to rely on the GPU Free Memory value to dynamically compute some heuristics such as about the batch size. However the GPU memory is not released (even after delete, gc.collect(), nvidia-smi, pycuda.tools.clear_context_caches()…) and is re-used by mxnet for efficiency - which prevents measuring the real free memory.

Is there a know way to explicitly reclaim unused GPU memory?
Or any alternate idea?

This need appeared a few times some time ago, but I haven’t found any solution since (https://github.com/apache/incubator-mxnet/issues/1946, https://github.com/apache/incubator-mxnet/issues/2827)

Many thanks,
AL


#2

Hi AL, I’m not really sure how you would go about doing this because like you said mxnet GPU memory deallocation is asynchronous. Looks like maybe this merged pr https://github.com/apache/incubator-mxnet/pull/2927/files attempted to address some of those issues. You can try playing around with some of the environment variables here https://github.com/apache/incubator-mxnet/blob/master/docs/faq/env_var.md#memory-options particularly MXNET_GPU_MEM_POOL_RESERVE