Fixing thread safety issues in Scala library

calum · November 7, 2017, 1:49pm

As mentioned in a previous discussion, we are users of the MXNet Scala library, and we discovered that there’s a couple of threading-related issues in the MXNet Scala library.

We plan to fix this within the Scala library and contribute this fix back to the project. There’s a few ways we could fix this, some more invasive than the others. We’re looking for opinions on the best path forward, and were told this site was the best place to solicit such feedback.

Issues

The issue is that the non-naive MXNet engines are unsafe to call from more than one thread for the duration of program execution, rather than concurrently. This means that once MXNet is used from a given thread, it must never be called from any other thread. Calling from any other thread causes memory corruption which causes various types of failure when running at scale.

There’s two primary issues here:

Issue 1: Usability

Since many JVM environments are multi-threaded, it’s difficult to use MXNet safely. The usual approaches for taming a non-threadsafe library, such as thread-local instances or global mutexes, are not effective as MXNet requires all calls to the engine to be run (for the duration of process lifetime) from the same thread.

This means that any user wanting to use the Scala library within a “normal” JVM context, such as a server environment or within a processing system like Spark, needs to do extra work to ensure that only a single thread interacts with the MXNet library. The ways in which MXNet fails are surprising (and severe) when these issues occur, and can take significant work to diagnose.

Issue 2: Finalization of native-backed resources

Like other libraries which manage native-backed resources, MXNet uses the Java finalize() method (finalizers), which are run when an object is garbage-collected, to clean up these resources for items which are not cleaned up explicitly by the library user. An example of this is NDArray objects where dispose() is not explicitly called, in which case the finalizer disposes the native array on the coder’s behalf.

Although this is convenient, this also causes the memory corruption issue mentioned above, because finalizers always run on the special “Finalizer” thread. This is never the thread on which other MXNet work is performed, and as such always has the potential to corrupt memory. Our experiments indicate that running at reasonable scale this will happen with certainty.

As such the way in which finalizers are used in MXNet is unsafe. It, at least, needs to change.

Proposed fixes

We have two main proposals. The first is uninvasive but less convenient, the second will make the library easier to use but will lead to a lack of fine control by library users.

We also note a more radical change, for discussion’s sake.

Fix 1: Change finalizers to log leaks but not clear them up

This is effectively a minimal fix which will make the library usable. We propose that we:

Remove the “dispose” functionality from the finalizers in the library
Replace it with functionality which will log a warning that a leak has occurred
Add a system property mxnet.traceLeakedResources which will trace the creation of all “leakable” resources, and add this to the warning message (this is controlled by a switch as the runtime cost of collecting this information may be high)

This allows users to understand where leaks originate and fix them, but doesn’t change the execution model of the library at all.

Since several of the leaks we’ve identified occur in the Scala library itself, we would use this functionality in order to trace and fix leaks in code we currently use (we have already done this well enough that we can run predictions at scale).

Fix 2: Implement a “dispatcher” thread in the Scala library

In our own use of MXNet, we’ve found it’s easiest to ensure that all use of MXNet occurs on one thread by mediating all use of MXNet through a singleton “dispatcher” thread in our codebase. This involves a library which starts a daemon thread, and then adding an object with an interface like the following:

object MXNetThread {
  /**
    * Runs a task on the MXNet thread.
    *
    * If the MXNet thread is the current thread, task is run immediately. Otherwise the task is passed to the MXNet thread and the current thread
    * blocks until its return.
    */
  def run[T]( task: => T ): T = ???
}

This allows one to dispatch MXNet work like this:

val result = MXNetThread.run {
  // Do some work using MXNet
}

def someMethod(someArgs: String): String = MXNetThread.run {
  // Do some work with MXNet
}

This proposal would be to push this work into the Scala library itself, and force all uses of the MXNet native library to go through this thread. This would make the library safe to use in all configurations without additional work from the users of the library. This would also allow us to rewrite the finalizers to dispose objects safely, as they can defer this disposal to the MXNet thread.

The downsides are:

Forces all users of the library to use the thread. This shouldn’t be necessary when using the NaiveEngine, and others might want to implement their own ways of handling the threading issues with MXNet which fit better into their containers and so on.
Causes possibly-surprising blocking when waiting for availability of the thread. A simple implementation won’t make this easily measurable.
Very invasive to the library in general, as this needs to be implemented everywhere that the MXNet native library is used.

Sidenote: The “Nuclear” option

There’s a final more dramatic option, where the Scala library is pared down to only support prediction (developments like Gluon indicate that Scala isn’t likely to support the state-of-the-art in model construction and training in the future anyway). This would allow Fix 2 to be implemented more easily, and provide a “leaner” Scala library that could be used just to load in models created externally.

Request for comment

We’re definitely going to at least implement the first fix, but we’d like to know whether or not there’s any appetite for the second (or possibly even the third). We’d also be really interested to hear if anyone has any other ideas about how to solve this for users of the Scala library.

Thanks all!

eric-haibin-lin · November 17, 2017, 11:28pm

Is the fix only for scala? How does the fix impact the cpp backend, or other language bindings?

calum · November 20, 2017, 8:26am

Hi!

The suggestions here are to do nothing with the other language APIs. It may be that the way the C++ API behaves is more idiomatic in C++, so is fine. Similarly it’s a little less common to have a multi-threaded Python setup (though I doubt it’s uncommon). When we checked the comments in the MXNet GitHub, it appeared that the development team were happy with MXNet’s execution model, so we plan to just make that existing model more usable from Scala.

The approach of the dispatcher thread should work equally well in Python, if one was in a multi-threaded environment. To the best of my knowledge the Python API does not share the issue with finalizers but I don’t have a lot of experience with it.

Does that all make sense?

yizhiliu · November 22, 2017, 2:19am

Hi Calum,

I’m excited to see the discussion. You find the very critical issue. I think we shall no longer rely on the finalizer since it is not guaranteed to be run. Moreover, when it runs, usually it has been too late. Take NDArrays as an example, they take little memory in JVM, GC probably will not touch them for a long time, while the off-heap memory increases.

I think the ‘Fix 1’ makes sense. For the reason I listed, I don’t like ‘Fix 2’.

And I do want to keep Scala package focusing on inference/prediction since it meets most cases.

calum · November 22, 2017, 8:29am

Thanks, it’s really great to have your feedback!

Moreover, when it runs, usually it has been too late. Take NDArrays as an example, they take little memory in JVM, GC probably will not touch them for a long time, while the off-heap memory increases.

I agree. We have found this is the case with other frameworks which rely on GC for managing native-backed resources. With them you usually end up having to force GC with System.gc() anyway.

For the reason I listed, I don’t like ‘Fix 2’.

I’m not sure what reason you mean here, could you elaborate? I do think it’s a significant change to the library, so did anticipate this being less palatable.

Thanks again!

yizhiliu · November 23, 2017, 9:38pm

I think if we decide not to rely on finlaizer, there’s no need to force users to run it in the ‘dispatcher’ thread?

calum · November 24, 2017, 8:37am

It’s true that we do not need to force users to use a dispatcher thread, but in effect unless their code is naturally single-threaded, they will be forced to do so. While we’ve focused on the finalizer here, the bigger issue is that MXNet is not safe to run in more than one thread for the duration of program execution (regardless of where this happens). It corrupts its memory and fails. This is more than just needing to surround usages with “synchronized” or similar, since those will allow the code to run on different threads, just not at the same time. This isn’t enough for MXNet.

The reason the finalizers are particularly problematic is that the caller doesn’t get to control what thread they run on. Once they’re gone it is possible (but difficult) to use MXNet in a safe way from multi-threaded code (I believe the easiest thing to do is to implement a dispatcher thread, similar to the one we’ve described, and carefully make sure all uses of MXNet in the calling code are dispatched from there).

You’re right that a dispatcher thread is not required for naturally single-threaded calling code (which never spawns or uses other threads which might refer to MXNet), I think my argument is that this is relatively rare in the JVM world. I appreciate that by implementing a dispatcher in the library, we might be preventing someone with a special case from implementing their case though, and we might be putting a (small) performance overhead on users who don’t need the thread.

Does that all make sense?

yizhiliu · November 24, 2017, 8:04pm

Well, I agree that it’ll be useful to provide a dispatcher, it prevents users from running into an unexpected situation. While I think, once we remove dispose from finalizer, this fix becomes less important. It is more like a tool than a fix.

I also agree that we should dive in to see how we can fix the multi-thread thing in the backend engine.

calum · November 30, 2017, 10:01am

We have submitted a version of “Fix 1” as a pull request.

nswamy · January 8, 2018, 6:39pm

Hi Calum,

I am wondering how did you find that “MXNet requires all calls to the engine to be run (for the duration of process lifetime) from the same thread.”

This would be very weird and I am not aware that MXNet has this requirement, that would have been quite a big restriction.
Just wondering if this could just be specific to how Scala Interface is developed or if you are hitting some bug thats causing this behavior.

Do you have some code that with which we can reproduce this behavior ?

I am guessing the problem could be with concurrency i.e., when you run the same executor from multiple threads at the same time

Thanks, Naveen

calum · January 9, 2018, 8:38am

I am wondering how did you find that “MXNet requires all calls to the engine to be run (for the duration of process lifetime) from the same thread.”

We observed the library corrupting memory and after some investigation decided this was the cause. We confirmed this with MXNet developers. The best public reference for this is an issue raised by another group where the response was:

The engine is not thread safe so there is no way to use multiple threads for pushing computation to the engine.

This confirmed our suspicion. When we moved on to preventing the system from calling from multiple threads this improved our reliability, and when we removed the finalizers it removed all of our crashes.

Do you have some code that with which we can reproduce this behavior ?

It’s quite unreliable due to the nature of the bug. What we were doing was running a very large number of predictions (tens of millions) on multiple threads using a relatively complex model (a more complex model seems to increase the likelihood of problems). We also saw problems with a feedforward network but less frequently. If you do not attempt to single-thread predictions, you see these issues almost every time you call at scale, though. We were only using CPU prediction and we do not know if GPU is affected.

Since all of this, a new CPUShared device type was added which we think might be intended to make these calls safer, but we haven’t bottomed that out.

I am guessing the problem could be with concurrency i.e., when you run the same executor from multiple threads at the same time

I’m afraid we don’t know enough about the internal structure of the C++ library is to know what the cause might be, but the problem was reproducible when every thread has its own Module instance.

Just wondering if this could just be specific to how Scala Interface is developed or if you are hitting some bug thats causing this behavior.

There is a possibility that there’s something that the Scala library is doing, but we didn’t see anything specific that jumped out at us. We do lack deep understanding of the underlying library and what guarantees it provides, so we might be missing something. The fact that the issue we referenced earlier was actually about the Python interface not being thread-safe due to the underlying library made this seem less likely to us, however.

Hope this helps!

nswamy · January 9, 2018, 11:31pm

I was referring to executor as a module. I re-opened the issue to ascertain if MXNet interaction should always happen from the same thread. I understand that you cannot call the same module concurrently from different threads but i am not sure if MXNet interaction has to happen always from the same thread.

calum · January 10, 2018, 8:42am

We were using a separate copy of the Module per-thread for a while and it did not help, but it’s possible there have been changes since we did this which make this work better.

Also it’s worth noting that the use from different threads does not need to be concurrent. Strictly non-overlapping calls from different threads also cause problems.

Topic		Replies	Views
Inference on Scala for Models trained on v1.0 is very slow Discussion	1	536	June 21, 2018
MXNet crashing, likely memory corruption Performance	9	2983	January 10, 2018
Thread-safety in C++ API Discussion cpp , performance	2	744	March 15, 2019
Error memory leak in multi thread use Threading library in python, when predict from same a model for multi thread Discussion	2	1159	September 5, 2019
Multi-Threaded Inference Question	1	1011	July 4, 2019