Troubleshooting Java Application Deadlocks – Diagnosing ‘Application Hang’ situations


Users expect applications to respond instantly. Deadlocks in Java applications result in ‘application hang’ situations that result in unresponsive systems and poor user experience.deadlock

This blog post explains what deadlocks are, consequences of deadlocks and options to diagnose them.

In a subsequent blog post, we’ll explore how the eG Java Monitor helps in pinpointing deadlock root causes down to the code level.

 

A typical production scenario

It is 2 am in the morning and you get woken up by a phone call from the helpdesk team. The helpdesk is receiving a flood of calls from application users. The application is reported to be slow and sluggish. Users are complaining that the browser keeps spinning and eventually all they see is a ‘white page’.Graphic of sys admin having to troubleshoot at night

Still somewhat heavy-eyed, you go through the ‘standard operating procedure’. You notice that no TCP traffic is flowing to or from the app server cluster. The application logs aren’t moving either.

You are wondering what could be wrong when the VP (Operations) pings you over Instant Messenger asking you to join a war room conference call. You will be asked to provide answers and pinpoint the root cause – fast.

What are Java application deadlocks?

A deadlock occurs when two or more threads form a cyclic dependency on each other as shown below.

In this illustration ‘thread 2’ is in a wait state waiting on Resource A owned by ‘thread 1’, while ‘thread 1’ is in a wait state waiting on Resource B owned by ‘thread 2’.Graphic of deadlocked threads having a circular dependency

In such a condition, these two threads are ‘hanging’ indefinitely without making any further progress.

This results in an “application hang” where the process is still present but the system does not respond to user requests.

“The JVM is not nearly as helpful in resolving deadlocks as database servers are.thumbnail graphic of java_concurrency_in_practice_book

When a set of Java threads deadlock, that’s the end of the game. Depending on what those threads do, the application may stall completely”

Brian Göetz et al, authors of “Java Concurrency in Practice”

Consequences of deadlocks

1. Poor user experience

When a deadlock happens, the application may stall. Typical symptoms could be “white pages” in web applications while the browser continues to spin eventually resulting in a timeout.graphic of browser timeout

Often, users might try to retry their request by clicking refresh or re-submitting a form submit which compounds the problem further.

2. System undergoes exponential degradation

When threads go into a deadlock situation,graphic of long-queue-people they take longer to respond. In the intervening period, fresh set of requests may arrive into the system.

graphic of how fresh requests would cause exponential system degradation due to backlogged threads

When deadlocks manifest in app servers, fresh requests will get backed up in the ‘execution queue’. Thread pool will hit the max utilization thereby denying new requests to get served. This causes further exponential degradation on the system.

 3. Cascading impact on the entire app server cluster

In multi-tier applications, Web Servers (such as Apache or IBM HTTP Server) receive requests and forward it to Application Servers (such as WebLogic, WebSphere or JBoss) via a ‘plug-in’ .cascading effect

If the plug-in detects that the Application Server is unhealthy, it will “fail-over” to another healthy application server which will accept heavier loads than usual thus resulting in further slowness.

This may cause a cascading slowdown effect on the entire cluster.

Why are deadlocks difficult to troubleshoot in a clustered, multi-tier environment?

Application support teams are usually caught off-guard when faced with deadlocks in production environments.

hot-tip

1. Deadlocks typically do not exhibit typical symptoms such as a spike in CPU or Memory. This makes it hard to track and diagnose deadlocks based on basic operating system metrics.

2. Deadlocks may not show up until the application is moved to a clustered environment. Single application server environments may not manifest latent defects in the code.

3. Deadlocks usually manifest in the worst possible time – heavy production load conditions. They may not show up in low or moderate load conditions. They are also difficult to replicate in a testing environment because of the same load condition reasons.

Options to diagnose deadlocks

There are various options available to troubleshoot deadlock situations.

1. The naïve way: Kill the process and cross your fingers

You could kill the application server process and hope that when the application server starts again, kill_processthe problem will go away.

However restarting the app server is a temporary fix that will not resolve the root-cause. Deadlocks would get triggered again when the app server comes back.

 

2. The laborious way: Take thread dumps in your cluster of JVMs

You could take thread dumps. To trigger a thread dump, we have to send a SIGQUIT signal. (On UNIX, that would be a “kill -3” command and on Windows, that would be a “Ctrl-Break” in the console).

Typically, you would need to capture a series of thread dumps (example: 6 thread dumps spaced 20 seconds apart) to infer any thread patterns – just a static thread dump snapshot may not suffice. thread_dump

If you are running the application server as a Windows service (which is usually the case), it is a little more complicated. If you are running the Hotspot JVM, you could use the jps utility in order to find the process id and then use the jstack utility in order to take thread dumps. You can also use the jconsole utility to connect to the process in question.

You would have to forward the thread dumps to the development team and wait for them to analyze and get back. Depending on the size of the cluster, there would be multiple files to trawl through and this might entail significant time.

This is not an optimal situation you want to be at 2 am in the morning when the business team is waiting on a quick resolution.

cautionManual Processes to troubleshoot deadlocks can be time consuming

The manual approach of taking thread dumps assumes that you know which JVM(s) are suffering from deadlocks.

Chances are that the application is hosted in a high-availability, clustered Application Server farm with tens (if not hundreds) of servers.

complex_architecture

If only a subset of JVMs are undergoing the deadlock problem, you may not be in a position to precisely know which JVM is undergoing thread contention or deadlocks. You would have to resort to taking thread dumps across all of your JVMs.

This becomes a trial-and-error approach which is laborious and time consuming. While this approach may be viable for a development or staging environment, it is not viable for a business-critical production environment where ‘Mean Time To Repair’ (MTTR) is key.

 

3. The smart way: Leverage an APM

While an APM (Application Performance Management) product cannot prevent deadlocks from happening, they can certainly provide deeper visibility into the root cause down to the code level when they do happen.Smart way

In the next blog post, we’ll explore how the eG Java Monitor can help provide an end-to-end perspective of the system in addition to pinpointing the root cause for deadlocks.

Stay tuned!

 

About the Author

Arun Aravamudhan has more than 15 years of experience in the Java enterprise applications space. In his current role with eG Innovations, Arun leads the Java Application Performance Management (APM) product line.

Java Monitoring Made Easy! How We Eat Our Own Dog Food :-)


Several years ago when we started to use Java technology in our products, this technology was in its infancy. We had a lot of teething problems, but multi-platform support was important to us and we continued to pull along with Java technologies.

For many years, Java has lacked cost-effective, easy to use tools and methodologies to monitor applications. Troubleshooting was often a manual, trial and error approach. 

As our monitoring application got bigger, troubleshooting just got way more complex. Byte-code instrumentation has been one of the common ways that monitoring and troubleshooting tools for Java applications have used. This technology has been very expensive and resource intensive, and hence, often used in development environments and not in production.

The last couple of Java releases (JDK 1.5 and higher) have incorporated excellent monitoring and diagnostic interfaces that can be used to troubleshoot Java applications. The need to understand how our own Java-based monitoring application necessitated that we take a closer look at these monitoring specifications and interfaces for the Java Virtual Machine.

The result – a new Java monitoring module that is going to be available as an integral part of our next major product release.

Monitoring Java Applications to the Code-level using the eG Java Monitor

This software has been extensively used in our labs for the last couple of years, and i have experienced first hand how effective this technology is. The level of visibility and the precision of the diagnostics are incredible. This module has saved us endless hours of troubleshooting time and we hope that when this gets to our customers, they will benefit in the  same way.

We’re quite excited about this capability – the instrumentation provided in the JVM has been great, and hats off to our developers for providing a very clean and easy to user interface that should be simple to use not just for any support person but will also appeal to any Java programmer because it provides navigation similar to many of the tools built into the JVM.

You can read more about this technology here. You can also take a sneek-peak at this technology by viewing a recorded demonstration here. Please do contact us if you are interested in getting access to an early release of this software.