Performing asnumpy() on model ouput is really slow and consumes much GPU memory compares to a newly created ndarray. Is there anyway to speed up asnumpy() or detach output from graph to get better performance like Pytorch?
The MXNet is asynchronous, and once you call
asnumpy() it acts as a synchronization point. At this moment all GPU computations are waited to be completed.
What it means is that when you are waiting for
asnumpy() to give you numpy array, what is actually happening in the background is that MXNet waits for all computations to be done. That’s why you have an impression that
asnumpy is slow, while it is actually just waiting for all stuff to be completed.
You can get a graph of which operators are taking a long time to execute using the profiler https://mxnet.incubator.apache.org/versions/master/tutorials/python/profiler.html