Users expect applications to respond instantly. Deadlocks in Java applications result in ‘application hang’ situations that result in unresponsive systems and poor user experience.
This blog post explains what deadlocks are, consequences of deadlocks and options to diagnose them.
In a subsequent blog post, we’ll explore how the eG Java Monitor helps in pinpointing deadlock root causes down to the code level.
A typical production scenario
What are Java application deadlocks?
A deadlock occurs when two or more threads form a cyclic dependency on each other as shown below.
In this illustration ‘thread 2’ is in a wait state waiting on Resource A owned by ‘thread 1’, while ‘thread 1’ is in a wait state waiting on Resource B owned by ‘thread 2’.
In such a condition, these two threads are ‘hanging’ indefinitely without making any further progress.
This results in an “application hang” where the process is still present but the system does not respond to user requests.
Consequences of deadlocks
1. Poor user experience
When a deadlock happens, the application may stall. Typical symptoms could be “white pages” in web applications while the browser continues to spin eventually resulting in a timeout.
Often, users might try to retry their request by clicking refresh or re-submitting a form submit which compounds the problem further.
2. System undergoes exponential degradation
When threads go into a deadlock situation, they take longer to respond. In the intervening period, fresh set of requests may arrive into the system.
When deadlocks manifest in app servers, fresh requests will get backed up in the ‘execution queue’. Thread pool will hit the max utilization thereby denying new requests to get served. This causes further exponential degradation on the system.
3. Cascading impact on the entire app server cluster
In multi-tier applications, Web Servers (such as Apache or IBM HTTP Server) receive requests and forward it to Application Servers (such as WebLogic, WebSphere or JBoss) via a ‘plug-in’ .
If the plug-in detects that the Application Server is unhealthy, it will “fail-over” to another healthy application server which will accept heavier loads than usual thus resulting in further slowness.
This may cause a cascading slowdown effect on the entire cluster.
Why are deadlocks difficult to troubleshoot in a clustered, multi-tier environment?
Application support teams are usually caught off-guard when faced with deadlocks in production environments.
Options to diagnose deadlocks
There are various options available to troubleshoot deadlock situations.
1. The naïve way: Kill the process and cross your fingers
You could kill the application server process and hope that when the application server starts again, the problem will go away.
However restarting the app server is a temporary fix that will not resolve the root-cause. Deadlocks would get triggered again when the app server comes back.
2. The laborious way: Take thread dumps in your cluster of JVMs
You could take thread dumps. To trigger a thread dump, we have to send a SIGQUIT signal. (On UNIX, that would be a “kill -3” command and on Windows, that would be a “Ctrl-Break” in the console).
Typically, you would need to capture a series of thread dumps (example: 6 thread dumps spaced 20 seconds apart) to infer any thread patterns – just a static thread dump snapshot may not suffice.
If you are running the application server as a Windows service (which is usually the case), it is a little more complicated. If you are running the Hotspot JVM, you could use the jps utility in order to find the process id and then use the jstack utility in order to take thread dumps. You can also use the jconsole utility to connect to the process in question.
You would have to forward the thread dumps to the development team and wait for them to analyze and get back. Depending on the size of the cluster, there would be multiple files to trawl through and this might entail significant time.
This is not an optimal situation you want to be at 2 am in the morning when the business team is waiting on a quick resolution.
3. The smart way: Leverage an APM
While an APM (Application Performance Management) product cannot prevent deadlocks from happening, they can certainly provide deeper visibility into the root cause down to the code level when they do happen.
In the next blog post, we’ll explore how the eG Java Monitor can help provide an end-to-end perspective of the system in addition to pinpointing the root cause for deadlocks.