Dealing with memory leaks (NDArray)

At Magnet we’re trying MXNet via the Clojure bindings. You can read more on: Clojure MXNet for musculoskeletal disease diagnosis.

One technical issue we have found is that our training reports memory leaks, and those are in Scala API we don’t control, and can’t call dispose on those objects.

Here you have an example of an leak trace:

WARN  org.apache.mxnet.WarnIfNotDisposed: LEAK: An instance of class org.apache.mxnet.NDArray was not disposed. Creation point of this resource was:
	java.lang.Thread.getStackTrace(Thread.java:1559)
	org.apache.mxnet.WarnIfNotDisposed$class.$init$(WarnIfNotDisposed.scala:52)
	org.apache.mxnet.NDArray.<init>(NDArray.scala:549)
	org.apache.mxnet.NDArray$$anonfun$genericNDArrayFunctionInvoke$4$$anonfun$6.apply(NDArray.scala:100)
	org.apache.mxnet.NDArray$$anonfun$genericNDArrayFunctionInvoke$4$$anonfun$6.apply(NDArray.scala:100)
	scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	scala.collection.AbstractTraversable.map(Traversable.scala:104)
	org.apache.mxnet.NDArray$$anonfun$genericNDArrayFunctionInvoke$4.apply(NDArray.scala:100)
	org.apache.mxnet.NDArray$$anonfun$genericNDArrayFunctionInvoke$4.apply(NDArray.scala:99)
	scala.Option.getOrElse(Option.scala:121)
	org.apache.mxnet.NDArray$.genericNDArrayFunctionInvoke(NDArray.scala:99)
	org.apache.mxnet.NDArray$.crop(NDArray.scala:33)
	org.apache.mxnet.module.DataParallelExecutorGroup$$anonfun$loadGeneralMulti$2$$anonfun$apply$2.apply(DataParallelExecutorGroup.scala:52)
	org.apache.mxnet.module.DataParallelExecutorGroup$$anonfun$loadGeneralMulti$2$$anonfun$apply$2.apply(DataParallelExecutorGroup.scala:35)
	scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	org.apache.mxnet.module.DataParallelExecutorGroup$$anonfun$loadGeneralMulti$2.apply(DataParallelExecutorGroup.scala:35)
	org.apache.mxnet.module.DataParallelExecutorGroup$$anonfun$loadGeneralMulti$2.apply(DataParallelExecutorGroup.scala:34)
	scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	scala.collection.Iterator$class.foreach(Iterator.scala:893)
	scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
	org.apache.mxnet.module.DataParallelExecutorGroup$.loadGeneralMulti(DataParallelExecutorGroup.scala:34)
	org.apache.mxnet.module.DataParallelExecutorGroup$.org$apache$mxnet$module$DataParallelExecutorGroup$$loadData(DataParallelExecutorGroup.scala:72)
	org.apache.mxnet.module.DataParallelExecutorGroup.forward(DataParallelExecutorGroup.scala:486)
	org.apache.mxnet.module.Module.forward(Module.scala:447)
	sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	java.lang.reflect.Method.invoke(Method.java:498)
	clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:93)
	clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
	org.apache.clojure_mxnet.module$forward.invokeStatic(module.clj:186)
	org.apache.clojure_mxnet.module$forward.invoke(module.clj:179)
	org.apache.clojure_mxnet.module$forward.invokeStatic(module.clj:192)
	org.apache.clojure_mxnet.module$forward.invoke(module.clj:179)
	org.apache.clojure_mxnet.module$fit$fn__1538.invoke(module.clj:567)
	org.apache.clojure_mxnet.module$fit.invokeStatic(module.clj:564)
	org.apache.clojure_mxnet.module$fit.invoke(module.clj:535)
	xenon.train$fit_epoch.invokeStatic(train.clj:60)
	xenon.train$fit_epoch.invoke(train.clj:56)
	xenon.train$fit$iter__2325__2329$fn__2330$fn__2331.invoke(train.clj:71)
	xenon.train$fit$iter__2325__2329$fn__2330.invoke(train.clj:71)
	clojure.lang.LazySeq.sval(LazySeq.java:40)
	clojure.lang.LazySeq.seq(LazySeq.java:49)
	clojure.lang.RT.seq(RT.java:528)
	clojure.core$seq__5124.invokeStatic(core.clj:137)
	clojure.core$dorun.invokeStatic(core.clj:3125)
	clojure.core$dorun.invoke(core.clj:3125)
	xenon.train$fit.invokeStatic(train.clj:71)
	xenon.train$fit.invoke(train.clj:69)
	xenon.train$fine_tune_BANG_.invokeStatic(train.clj:80)
	xenon.train$fine_tune_BANG_.invoke(train.clj:73)
	xenon.train$fine_tune_BANG_.invokeStatic(train.clj:75)
	xenon.train$fine_tune_BANG_.invoke(train.clj:73)
	xenon.core$_main.invokeStatic(core.clj:6)
	xenon.core$_main.doInvoke(core.clj:5)

It seems that the NDArray object leak is present in some other code (Clojure or Scala). You can see the presentation on the virtual meetup, where an example code also leaked some NDArray objects.

Is this related to issues with FeedForward.scala and will be solved by the new auto collector?

Any hints/pointers on this topic are highly appreciated.

Iván

After looking at the issue tracker I found this issue: [MXNET-731] Memory Management in Scala binding - ASF JIRA

Success criteria: Scala API users should not have to take special care about disposing objects created by MXNet and hence should not see memory leaks with their implementation of MXNet using Scala API binding.

Sounds promising

There is research going on in this front on the Scala side which the Clojure package can leverage. I’m not sure of the time frame for that though. If you need something more immediate, I’d be happy to work with you on a solution. Possible paths would be creating a macro for NDArray management in the spirit of with-open or leveraging NDArrayCollector.

Please feel free to open a specific github issue for it and we can discuss possible solutions.

I have a PR open here to add a with-resources macro that will help with management. Please take a look and let me know any feedback https://github.com/apache/incubator-mxnet/pull/12447

Looks good :+1: - My comments in the PR https://github.com/apache/incubator-mxnet/pull/12447#issuecomment-418231600

I had pretty much the same stack trace as the top post, which made my computer run out of memory very soon after starting training. My fix to this problem was to modify the Scala source code, specifically the scala-package/core/src/main/scala/org/apache/mxnet/module/DataParallelExecutorGroup.scala with the following modifications:

This probably does not deserve a pull request because it might hide deeper issues with the NDArray management, but it made it possible for me to move forward.

Thank you all for the hard work!

@nswamy @yizhiliu can you have a look whether the code should be changed in the DataParallelExecutorGroup to have the collector scope enabled by default?

thanks @jonasseglare for posting your work-around. I am not too familiar with the Scala bindings, but I believe the collector.withScope was introduced recently and I think maybe this could be a leftover that should be fixed in the package.