Memory usage grows over time and causes cudaMallocErrors

We have some CNNs deployed with Mxnet Model server with 3 to 4 workers each having an instance of the network. The network takes a uniform sized input and batch size 1.

We are using cuda9.2 with mxnet-cu92==1.5.0

The behavior we see is:

  1. On first calls, AUTOTUNE runs and memory spikes and then settles into a roughly predictable number, 75% in this example.
  2. As the network is used, the memory usage grows a little bit at a time
  3. Eventually the memory hits 100% and then we start seeing cudaMalloc errors.
  4. Memory is not released and the server needs to be rebooted.


The above image has some GPU metrics from a recent test run. Green and Blue show the model grow over time. Red is the same model with MXNET_GPU_MEM_POOL_RESERVE set to 25. (Orange can be ignored, it was a server that I took down shortly after coming up).

We can work-around this by:

  • Lowering the number of workers
  • Setting MXNET_GPU_MEM_POOL_RESERVE to a high enough buffer that 100% is never reached.

1 What causes the memory to grow over time? My team expected the memory usage to be constant after the autotuning finishes, especially with single sized inputs.

2 How should we go about recovering from cudaMalloc errors?

I have seen similar behaviour when I used images of different dimensions with the same model. Like if you feed an image of 512x512 and after that an image of 256x256 it would force autotune to run again. I remember that passing images of the same size helped me. I am not sure if it is the same in your case, though…

You can also try and switch off the autotune. It does require significant amount of GPU memory to be completed. you can set MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to do that.

Thanks for the reply Sergey,

We are feeding only a single size into the net so autotune only happens on first pass through.

We’ve tried turning autotune off before but it results in much longer processing time.

Well, I can only guess and provide some tips which may or may not work:

Do you hybridize the model? If so, try to set net.hybridize(static_alloc=True, static_shape=True) as it changes the way MXNet handles memory.

It could be that you put the data into the model faster than it can be processed, so it just takes too much of memory to put the data in. You can limit number of items to process simultaneously.

And, generally speaking, it is not recommended to share GPU between multiple concurrent processes as in your case, when you have a few workers with separate model instances. If it is possible, instead of having multiple instances, have one, but do batch data processing - combine multiple images in one batch and run the process. It won’t give you as real-time speed as an option with multiple instances, but certainly would be more stable.

Do you see anything suspicious happening on CPU/RAM side? It is often when the memory leaks are correlated between CPU and GPU, like CPU requests a lot of memory on RAM and then transfers that data to GPU. MXNet is compiled with tcmalloc support, so it can help you to detect problems on RAM side, which hopefully will be related to problems on GPU side. You can use tcmalloc from gperftools - https://github.com/gperftools/gperftools

Thanks for suggestions, I will test out static_alloc and static_shape. We won’t be able to switch to to a single instance in short term.

I have not seen anything suspicious on CPU/RAM side, but I will also investigate that.

Static_alloc and static_shape = True still shows the memory generally increase over time under load.

I see. Does it still fails with cudaMalloc at some point, or the memory is collected after some time?

Not in this particular case, but some of our models/configurations that start out with high memory usage (80 to 90%) grow into 100% and then cudaMallocError crops up and the memory is never collected.

Note that with a high MXNET_GPU_MEM_POOL_RESERVE value, it is collected if it goes passed the reserved amount of memory.

I ran a test with the memory pool turned off (MXNET_GPU_MEM_POOL_TYPE = Unpooled). Memory goes up and down, but the bottom line is stable. The highest peaks show a 10% increase.

gpumem_unpooled