Huristic/io

View Original

Fun Java Deadlock Puzzle

A fun Java concurrency problem has been seen around some corners of the internet. I'm not sure of the true origin of this problem, but I was introduced to it from a friend of a friend who saw a poster somewhere around an Amazon stand at some software engineering conference. It centers around a simple Java class ThreadNinja which uses reckless synchronization around methods and is designed to generate a deadlock. The goal is to describe a scenario in which a deadlock would occur. The puzzle part is not too puzzling, but not obvious at first glance, so I thought I'd throw in my two cents and maybe show some code that could be used to prove (and I use the term lightly) the outcome. I'll try to keep this post short but sweet.

Also, just going to reiterate, I did not come up with this problem myself, and I would love to give credit to the author. Additionally, this puzzle is just that; it's a toy program designed to demonstrate Java synchronization methods and danger of deadlocks when synchronization is not used properly. Don't yell at me for writing bad code, it's meant to be bad.

The problem

Imagine you have a Java class, defined below, which is guaranteed to result in a deadlock in certain scenarios. The original problem added a small comment block pointing out some nuances of JVM synchronization mechanism. 

For the uninitiated: Every object instance in Java has one mutex associated with it. Before a function with the synchronized keyword can be executed, the mutex must be acquired. So every synchronized function in Java is like a critical section, protected by the object instance's mutex.

See this content in the original post

The goal 

Describe a scenario in which this code has a deadlock.

Solution

The way this class is written is confusing, and it's done so on purpose to mislead you and hide the offending resource lock contention code. Thus, I'm not going to attempt to describe what this code does using an even more ambiguous language (English). The best way to understand it is to just read the code and trace it; take a moment for the brain to stop hurting. Now that we have read and understood the logic, how do we attempt to solve the puzzle?  Well, if you think about it, this is a classic example of a deadlock. Check out the Wikipedia definition of a deadlock, it's pretty straight forward. There are several conditions defined as required to generate a deadlock:

A deadlock situation on a resource can arise if and only if all of the following conditions hold simultaneously in a system:[5]

  1. Mutual exclusion: At least one resource must be held in a non-shareable mode. otherwise, the processes would not be prevented from using the resource when necessary. Only one process can use the resource at any given instant of time.[6]
  2. Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
  3. No preemption: a resource can be released only voluntarily by the process holding it.
  4. Circular wait: each process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource. In general, there is a set of waiting processes, P = {P1, P2, …, PN}, such that P1 is waiting for a resource held by P2, P2 is waiting for a resource held by P3 and so on until PN is waiting for a resource held by P1.[3]

One side note is, it's important to recognize that when a method is synchronized, the synchronization is being done on the intrinsic lock of the specific instance of the class not on the class itself. If these methods were static, then the synchronization would be on the class object thus any thread calling the static method would be locking the method application wide. In our case, if we have n ThreadNinja instances we can have n threads calling the fight() method at the same time. 

So what this boils down to is: we need at least 2 threads/processes and we need at least two contended resources. In the ThreadNinja class we actually have 3 synchronized blocks, and I'll argue that the getRank() method in Sword class is the most important piece to this puzzle. This class and instance is private to an instance of ThreadNinja so at first glance it may appear that no external thread can get or would attempt to get a lock on this object; this is a wrong observation. I will admit, on first look I thought exactly the same thing, but when tracing the execution path it became clear this is the key to the whole puzzle. The problem is the order of operations, let me explain. We have two locks: one upon entering the fight() method and one upon entering the canDefendAgainst() method. The key here is the fight method is calling the canDefentAgainst() method on the opponent (a different instance of ThreadNinja). That's where the deadlock can occur. You have two mutually exclusive resources via the two independent instances of ThreadNinja. Referencing the requirements for a theoretical deadlock, that's the hold and wait, in other words blocking, synchronization operation being performed on the two resources. Java synchronized blocks are re-entrant, however, we have two mutually exclusive objects so a thread calling the fight() method on instance 1 and then calling canDefendAgainst() on instance 2 (opponent) needs to get two locks on each instance respectively. 

We also meet the no preemption requirement because the lock can only be released by the lock holding thread voluntarily via exiting the synchronized scope. While the thread remains in the synchronized block no other thread can release it's lock.

Finally, the circular wait requirement is also met. Assume for a moment we have two instances of ThreadNinja n1 and n2 and we have two threads t1 and t2. If within t1 we call n1.fight(n2) and within t2 we call n2.fight(n1) we create the circular wait. Since t1 can get into n1.fight() it will attempt to call n2.canDefendAgainst(n1) and t2 can get into n2.fight and will attempt to call n1.canDefentAgainst(n2). So t1 has lock on n1.fight() and needs a lock on n2.canDefendAgainst() while (here's the key) also grabs a lock on n1.sword.getRank(). While t2 has a lock on n2.fight() and needs a lock on n1.canDefendAgainst() but can't get it because it's locked by t2. But t2 has lock on n2.sword.getRank() so t1 can't get lock on  n2.canDefendAgainst(). This is confusing, and hard to wrap one's brain around.. I love it!!

Proof?

I'm not a mathematician so I'm not even going to attempt any kind of formal mathematical proof here. What I am is a software engineer so my proof consist of writing a program which can detect deadlocks, run it and see a deadlock manifest itself. First, we need a driver program to create the deadlock. So given my confusing explanation above it's as simple as t1 -> n1.fight(n2) and t2 -> n2.fight(n1)

See this content in the original post

Since the deadlock depends on correct timing, we need to somehow ensure that the threads get to the correct point at the exactly the right time. I could use some sort of CountDownLatch or CyclicBarrier to lineup the threads in the perfect way. However, I believe that simpler is better, so I'm going to rely on the power of entropy. My hypothesis is running the fight() sequence in two threads in a while loop should eventually cause the threads to call just the right method at just the right time to cause the deadlock, due to the unpredictable nature of the JVM and underlying OS scheduling mechanisms. This could take a long time, or it could happen very quickly. I believe that because of way the ThreadNinja class is written it's very prone to generating a deadlock very quickly. It turns out I'm right, running this code, in a matter of seconds the program stops responding. Though it's hard to tell a deadlock has occurred, because I'm not outputting anything. I did that on purpose because adding System.out calls into the threads can introduce even more entropy,  causing the deadlock to take longer to manifest. I need a way to detect a deadlock programmatically. Java's ThreadMXBean to the rescue. Checkout this nice post which very nicely explains how the ThreadMXBean works and how to use it correctly. Here's my quick implementation:

See this content in the original post

What I'm doing here is scheduling a new task to execute every 5 seconds and use ThreadMXBean's method findDeadlockedThreads() to look for deadlocked threads. Once found it invoke's a handler function which outputs the results. Here's how it's used within the driver program from above:

See this content in the original post

The deadlock handler is just a lambda function which gets a collection of threads which have been detected to be deadlocked and prints out the stacktrace frame to show where the deadlock occured. It's simple and elegant, slow (don't do this in production) and correct. 

Output

When I run this program on my Core i7 4 core CPU machine I get the following output in about 5 seconds

See this content in the original post

Thoughts:

This is as comprehensive of a proof as I'm going to come up with. It shows that the deadlock lock does occur and it shows it occurs in exactly the place I hypothesised it would occur in. If anyone thinks I totally got this wrong and completely blundered the solution I'm very eager to hear from you. In the meantime I'm going to call this one solved. 

See the all the code and a running version of the driver program here: https://github.com/dkhanaferov/thread-ninja-deadlock