MXNet crashing, likely memory corruption


#1

We’re running Scala MXNet (v0.11.0) on Spark+EMR. Training completes successfully, including a smaller-scale prediction stage (over the test set). When doing large-scale prediction, it breaks down.

Memory appears to get corrupted somehow, and it starts failing on simple checks like checking device type, but we observe it failing random other checks non-deterministically as well.

e.g.

    17/10/16 15:07:23 WARN TaskSetManager: Lost task 13.0 in stage 26.0 (TID 5335, ip-10-0-142-94.ec2.internal, executor 15): ml.dmlc.mxnet.MXNetError: [15:07:23] /opt/<mxnet-0.11.0>/AL2012/generic-flavor/src/include/mxnet/././tensor_blob.h:272: Check failed: Device::kDevMask == this->dev_mask() TBlob.get: device type do not match specified type
 
    Stack trace returned 8 entries:
    [bt] (0) /mnt/yarn/usercache/hadoop/appcache/application_1508165168179_0001/container_1508165168179_0001_01_000016/tmp/mxnet3990381783401778979/mxnet-scala(_ZN4dmlc15LogMessageFatalD2Ev+0x26) [0x7f839bce0c76]
    [bt] (1) /home/hadoop/additional-libraries/libmxnet.so(_ZNK5mxnet5TBlob14get_with_shapeIN7mshadow3cpuELi1EfEENS2_6TensorIT_XT0_ET1_EERKNS2_5ShapeIXT0_EEEPNS2_6StreamIS5_EE+0x80) [0x7f83857e78b0]
    [bt] (2) /home/hadoop/additional-libraries/libmxnet.so(_ZNK5mxnet5TBlob8FlatTo1DIN7mshadow3cpuEfEENS2_6TensorIT_Li1ET0_EEPNS2_6StreamIS5_EE+0x18d) [0x7f8385b04d5d]
    [bt] (3) /home/hadoop/additional-libraries/libmxnet.so(_ZN5mxnet7ndarray4CopyIN7mshadow3cpuES3_EEvRKNS_5TBlobEPS4_NS_7ContextES8_NS_10RunContextE+0x38f2) [0x7f8386169032]
    [bt] (4) /home/hadoop/additional-libraries/libmxnet.so(_ZNK5mxnet7NDArray13SyncCopyToCPUEPvm+0x5ea) [0x7f838618c95a]
    [bt] (5) /home/hadoop/additional-libraries/libmxnet.so(MXNDArraySyncCopyToCPU+0xa) [0x7f838611901a]
    [bt] (6) /mnt/yarn/usercache/hadoop/appcache/application_1508165168179_0001/container_1508165168179_0001_01_000016/tmp/mxnet3990381783401778979/mxnet-scala(Java_ml_dmlc_mxnet_LibInfo_mxNDArraySyncCopyToCPU+0x33) [0x7f839bcd77b3]
    [bt] (7) [0x7f8c1f00f08a]
 
                at ml.dmlc.mxnet.Base$.checkCall(Base.scala:131)
                at ml.dmlc.mxnet.NDArray.internal(NDArray.scala:940)
                at ml.dmlc.mxnet.NDArray.toArray(NDArray.scala:933)
                at com.example.project.mxnet.ModuleUtils$$anonfun$predict$4.apply(ModuleUtils.scala:31)
                at com.example.project.mxnet.ModuleUtils$$anonfun$predict$4.apply(ModuleUtils.scala:17)
                at com.example.project.mxnet.MXNetThread$$anon$4.call(MXNetThread.scala:46)
                at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                at java.lang.Thread.run(Thread.java:748)

This is something we previously observed when we were calling MXNet from multiple threads. Fixing the threading issues (by only calling MXNet from a single thread) got rid of the issue when running with a simple architecture, but the issue returns when experimenting with deeper networks.

Any ideas? (We’ll continuing to investigate.)

Simon


Fixing thread safety issues in Scala library
#2

Do you have sufficient memory on this instance?
How big (in terms of memory footprint) is the network you are running predictions with?
Did you say this particular model was trained successfully on this machine?


#3

Do you have sufficient memory on this instance?
How big (in terms of memory footprint) is the network you are running predictions with?

Yes, we never exceed 50% of the available memory on the host. Here’s a graph of the memory usage. First error occurs at the blue line.

Did you say this particular model was trained successfully on this machine?

Yes, the model was trained on the same hardware. It also successfully ran predictions over the test set. The main difference during inference is the stress we’re putting on it (many more inference batches, and we’re running a number of these hosts in parallel with their own copy of MXNet).


#4

Since you say “This is something we previously observed when we were calling MXNet from multiple threads. Fixing the threading issues (by only calling MXNet from a single thread) got rid of the issue when running with a simple architecture”, it makes me suggest the following.

Could you check the behavior with this env variable set: MXNET_ENGINE_TYPE=NaiveEngine ?

See also: https://github.com/apache/incubator-mxnet/pull/8371


#5

We’ve since managed to find the root cause of these issues. The problem is with MXNet doing cleanup in a way that’s not thread-safe.

Quoting @calum:

Since Scala/Java finalizers always run in a separate thread (there is a dedicated thread for this in the runtime) the Scala library’s use of finalize() to clean up leaked NDArrays and similar is unsafe. It mostly doesn’t matter when running at smaller scale or with simpler models, but when we start running prediction at-scale, we have problems.

We’re working on a patch to remove the finalizers and log where the item was leaked from instead. Some of the leaks may be coming from the Scala library’s own use of these objects. It might make sense for something like our dispatcher thread* to form part of the MXNet library. Perhaps the advice to use the NaiveEngine will be enough, but it seems as though this might not be ideal for performance.

* The single thread we are using to dispatch calls to MXNet, which is necessary to prevent threading issues. This was mentioned in passing in the original post.


#6

Does the NaiveEngine run an engine per-thread, or a single (single-threaded) engine per-process? The documentation makes it seem as though it’s the latter, which would be very detrimental to performance. Is that correct?


#7

Latter : NaiveEngine => Single-threaded engine per process. This suggestion was meant as a diagnostic to understand where the problem lies, as mentioned here:

http://mxnet.incubator.apache.org/how_to/env_var.html#engine-type


#8

Ah, thanks. I misunderstood as the documentation update you linked (the pull request).


#9

If we suspect memory corruption, I would recommend using debug build of libmxnet.so, if it has not been tried already. Debug build will turn on C++ runtime library’s extensive checks and I have found those checks finding the fault-line close to where memory corruption occurs.
Also, I will look into this as soon as I get a chance.


#10

We tracked the cause of this down to unsafe concurrency in the MXNet Scala library (specifically, Java’s finalizers).

The fix is discussed in this thread. This was eventually also merged into mxnet.

Note that this doesn’t make concurrency safe, but rather gets rid of unsafe concurrency internally. We are still using a dispatcher-object for thread-safe access to the library. We have been running stably (even with more complicated models) at scale since this was fixed.