Debug fine, but run OOM...HELP!


#1

Hi! I am using mxnet1.3 with CUDA 9.2 on Ubuntu 1604, the thing is when i debug my code using PyCharm IDE, the training part goes on well, but when i run the code directly, it reports OOM when calling xxx.asnumpy() in one of my callback function. My problem is: 1. Why is the code’s performance different? 2. If OOM do occurs, shouldn’t the error be raised in calling of forward&backward function?


#2

Pycharm is likely using a lot more memory doing display of your variables. Pycharm adds variable values on the source window. I’m not suprised about this.


#4

So if debugging leads to more memory consumption, why OOM doesn’t occur in debugging mode? Moreover, i believe that debugging in pycharm is consuming memory, which has little to do with gpu’s resources exhaustion…


#5

@Mooonside - I think that the fact that xxx.asnumpy() is causing the problem should be a pretty good clue that you’re having a problem on the CPU side (after all, PyCharm doesn’t run on the GPU). Can you monitor memory in parallel, e.g. using top in a terminal to watch consumption from PyCharm and from python proper.


#6

Thank you for your reply! So the exact error message is:

mxnet.base.MXNetError: [12:02:14] src/storage/./pooled_storage_manager.h:119: cudaMalloc failed: out of memory

And i check what you said, the CPU’s memory state is:


So i think that it’s GPU’s issue. And i find that even in debug mode, it will also report OOM occasionally, but not every time. Now what confuses me most is whether xxx.asnumpy() occupies any GPU memory? Why OOM doesn’t occur when calling forward&backward()?


#7

One more thing is that if i remove all the callback functions(i.e. all the xxx.asnumpy()), the code can run without OOM, and if i add one, running will fail and debugging will fail occasionally. If i add two, then both running and debugging fail…