We have some CNNs deployed with Mxnet Model server with 3 to 4 workers each having an instance of the network. The network takes a uniform sized input and batch size 1.
We are using cuda9.2 with mxnet-cu92==1.5.0
The behavior we see is:
- On first calls, AUTOTUNE runs and memory spikes and then settles into a roughly predictable number, 75% in this example.
- As the network is used, the memory usage grows a little bit at a time
- Eventually the memory hits 100% and then we start seeing cudaMalloc errors.
- Memory is not released and the server needs to be rebooted.
The above image has some GPU metrics from a recent test run. Green and Blue show the model grow over time. Red is the same model with MXNET_GPU_MEM_POOL_RESERVE set to 25. (Orange can be ignored, it was a server that I took down shortly after coming up).
We can work-around this by:
- Lowering the number of workers
- Setting MXNET_GPU_MEM_POOL_RESERVE to a high enough buffer that 100% is never reached.
1 What causes the memory to grow over time? My team expected the memory usage to be constant after the autotuning finishes, especially with single sized inputs.
2 How should we go about recovering from cudaMalloc errors?