Running Well: IT Problems in the News


IT-problems-in-the-news-performance-managementLast week’s news put a bright spotlight on how much impact IT professionals have on our world. Unfortunately, it’s the bad news that gets the headlines, from a configuration error at NYSE down to a single router “glitch” for United Airlines, but there is a positive side to the unfortunate delays and lost productivity from the past few days. These outages, whose negative impact was thankfully limited, have given many outside of the IT world a brief, visceral understanding of all the work that goes into the interdependent, global technology ecosystem we live in. Inside companies around the globe, executives may have taken an extra moment last week to appreciate the solid IT pros THEY have on the job, keeping their business running, and running well.

Here at eG Innovations, we live and breathe IT performance, because this is where the rubber meets the road, so to speak, for our clients. The driving goal behind our unified performance solution, with specific emphasis on Citrix monitoring, Java application monitoring, SAP monitoring and others, is to provide you with immediately actionable information about slow-downs and outages throughout your network, beyond just data and metrics. While most of our customers focus on preventative and proactive performance management, we’d like to throw in a mention that our Correlative Intelligence engine is a time-tested technology for dramatically reducing MTTR when outages do occur.

To learn more, below is a link to what our customers say about how our eG Enterprise technology helps them keep their IT engines running, and running well: http://www.eginnovations.com/web/usersurvey.htm

Troubleshooting Java Application Deadlocks – Diagnosing ‘Application Hang’ situations


Users expect applications to respond instantly. Deadlocks in Java applications result in ‘application hang’ situations that result in unresponsive systems and poor user experience.deadlock

This blog post explains what deadlocks are, consequences of deadlocks and options to diagnose them.

In a subsequent blog post, we’ll explore how the eG Java Monitor helps in pinpointing deadlock root causes down to the code level.

 

A typical production scenario

It is 2 am in the morning and you get woken up by a phone call from the helpdesk team. The helpdesk is receiving a flood of calls from application users. The application is reported to be slow and sluggish. Users are complaining that the browser keeps spinning and eventually all they see is a ‘white page’.Graphic of sys admin having to troubleshoot at night

Still somewhat heavy-eyed, you go through the ‘standard operating procedure’. You notice that no TCP traffic is flowing to or from the app server cluster. The application logs aren’t moving either.

You are wondering what could be wrong when the VP (Operations) pings you over Instant Messenger asking you to join a war room conference call. You will be asked to provide answers and pinpoint the root cause – fast.

What are Java application deadlocks?

A deadlock occurs when two or more threads form a cyclic dependency on each other as shown below.

In this illustration ‘thread 2’ is in a wait state waiting on Resource A owned by ‘thread 1’, while ‘thread 1’ is in a wait state waiting on Resource B owned by ‘thread 2’.Graphic of deadlocked threads having a circular dependency

In such a condition, these two threads are ‘hanging’ indefinitely without making any further progress.

This results in an “application hang” where the process is still present but the system does not respond to user requests.

“The JVM is not nearly as helpful in resolving deadlocks as database servers are.thumbnail graphic of java_concurrency_in_practice_book

When a set of Java threads deadlock, that’s the end of the game. Depending on what those threads do, the application may stall completely”

Brian Göetz et al, authors of “Java Concurrency in Practice”

Consequences of deadlocks

1. Poor user experience

When a deadlock happens, the application may stall. Typical symptoms could be “white pages” in web applications while the browser continues to spin eventually resulting in a timeout.graphic of browser timeout

Often, users might try to retry their request by clicking refresh or re-submitting a form submit which compounds the problem further.

2. System undergoes exponential degradation

When threads go into a deadlock situation,graphic of long-queue-people they take longer to respond. In the intervening period, fresh set of requests may arrive into the system.

graphic of how fresh requests would cause exponential system degradation due to backlogged threads

When deadlocks manifest in app servers, fresh requests will get backed up in the ‘execution queue’. Thread pool will hit the max utilization thereby denying new requests to get served. This causes further exponential degradation on the system.

 3. Cascading impact on the entire app server cluster

In multi-tier applications, Web Servers (such as Apache or IBM HTTP Server) receive requests and forward it to Application Servers (such as WebLogic, WebSphere or JBoss) via a ‘plug-in’ .cascading effect

If the plug-in detects that the Application Server is unhealthy, it will “fail-over” to another healthy application server which will accept heavier loads than usual thus resulting in further slowness.

This may cause a cascading slowdown effect on the entire cluster.

Why are deadlocks difficult to troubleshoot in a clustered, multi-tier environment?

Application support teams are usually caught off-guard when faced with deadlocks in production environments.

hot-tip

1. Deadlocks typically do not exhibit typical symptoms such as a spike in CPU or Memory. This makes it hard to track and diagnose deadlocks based on basic operating system metrics.

2. Deadlocks may not show up until the application is moved to a clustered environment. Single application server environments may not manifest latent defects in the code.

3. Deadlocks usually manifest in the worst possible time – heavy production load conditions. They may not show up in low or moderate load conditions. They are also difficult to replicate in a testing environment because of the same load condition reasons.

Options to diagnose deadlocks

There are various options available to troubleshoot deadlock situations.

1. The naïve way: Kill the process and cross your fingers

You could kill the application server process and hope that when the application server starts again, kill_processthe problem will go away.

However restarting the app server is a temporary fix that will not resolve the root-cause. Deadlocks would get triggered again when the app server comes back.

 

2. The laborious way: Take thread dumps in your cluster of JVMs

You could take thread dumps. To trigger a thread dump, we have to send a SIGQUIT signal. (On UNIX, that would be a “kill -3” command and on Windows, that would be a “Ctrl-Break” in the console).

Typically, you would need to capture a series of thread dumps (example: 6 thread dumps spaced 20 seconds apart) to infer any thread patterns – just a static thread dump snapshot may not suffice. thread_dump

If you are running the application server as a Windows service (which is usually the case), it is a little more complicated. If you are running the Hotspot JVM, you could use the jps utility in order to find the process id and then use the jstack utility in order to take thread dumps. You can also use the jconsole utility to connect to the process in question.

You would have to forward the thread dumps to the development team and wait for them to analyze and get back. Depending on the size of the cluster, there would be multiple files to trawl through and this might entail significant time.

This is not an optimal situation you want to be at 2 am in the morning when the business team is waiting on a quick resolution.

cautionManual Processes to troubleshoot deadlocks can be time consuming

The manual approach of taking thread dumps assumes that you know which JVM(s) are suffering from deadlocks.

Chances are that the application is hosted in a high-availability, clustered Application Server farm with tens (if not hundreds) of servers.

complex_architecture

If only a subset of JVMs are undergoing the deadlock problem, you may not be in a position to precisely know which JVM is undergoing thread contention or deadlocks. You would have to resort to taking thread dumps across all of your JVMs.

This becomes a trial-and-error approach which is laborious and time consuming. While this approach may be viable for a development or staging environment, it is not viable for a business-critical production environment where ‘Mean Time To Repair’ (MTTR) is key.

 

3. The smart way: Leverage an APM

While an APM (Application Performance Management) product cannot prevent deadlocks from happening, they can certainly provide deeper visibility into the root cause down to the code level when they do happen.Smart way

In the next blog post, we’ll explore how the eG Java Monitor can help provide an end-to-end perspective of the system in addition to pinpointing the root cause for deadlocks.

Stay tuned!

 

About the Author

Arun Aravamudhan has more than 15 years of experience in the Java enterprise applications space. In his current role with eG Innovations, Arun leads the Java Application Performance Management (APM) product line.

Web Application Performance Monitoring – Two Challenges you need to tackle to solve the “It’s Slow” problem


Graphic of multiple technology groups assembled in a war room troubleshooting an application performance problem wondering if the performance problem is due to network, app server, virtualization platform, database or custom application.

As an application owner or architect, have you been in situations where the users complain that the Web Application is slow, but there is no clear root cause?

How do we go from finger-pointing and hit-and-miss troubleshooting to pin-pointing the root cause?

This blog post frames the problem statement and outlines two key challenges that you need to tackle and resolve. Stay tuned as we dive into potential solutions in future blog posts.

 

A typical “It’s Slow” Scenario

The other day, I was talking to an IT Director of a leading multi-national bank, who explained his monitoring challenges which you may be able to relate to.

Case Study

The application is a typical multi-tier architecture built using multiple technologies. Multiple load balancers, web servers, application servers, backend web services, message queues and databases.

Users would always complain that the service is slow.

The organization had a ton of infrastructure tools that captured metrics for CPU, Memory and Disk. In addition, each team had point-products specific to their domain (middleware, database, network etc.).

As stated by the IT Director:

“Every time users complained of slowness, we would assemble a war room – all hands on deck. What’s frustrating for me is that although we have multiple cross-functional teams jumping in – nothing specific or actionable emerges.”

You’ve been in these situations before. Why is pinpointing root cause in a multi-tier Web Application such a pain?

There are several reasons – but it all boils down to this: Lack of holistic and coherent end-to-end performance monitoring and management perspective. I’ve outlined a flavor of the key challenges which we’ll explore in-depth in future blog posts.

Challenge #1: Symptoms Everywhere, Root Cause Nowhere

Most application performance management and monitoring is silo-based. When performance problems occur, each siloed team looks at point products in their own domain.

In a multi-tier application, there are complex inter-dependencies. There is a ripple effect of problems that cascade across tiers. Teams get caught up in troubleshooting the symptoms without understanding where the root-cause lies. Often, this leads to long troubleshooting cycles and increases the Mean Time To Repair (MTTR).

Graphic of performance monitoring for multi-tier web application having inter-dependencies but with siloed technology groups
Multi-tier applications have complex inter-dependencies.
Can you quickly pinpoint root cause from a variety of symptoms?

Challenge #2: “Ain’t My Problem” syndrome

Sometimes, the exact opposite happens. Each silo team lives in their island with their own tool-sets:

  • DBAs have database analysis tools
  • Network admins have packet sniffers, probes and protocol analyzers
  • Web server admins have web server log mining tools
  • Middleware admins have console applications such as Weblogic, WebSphere or JBoss console (assuming this is a java web application).
  • Application support teams have to trawl through tons of spaghetti custom logs

Each of the above teams have a narrow view of the system since there is little or no coordination across these siloed toolsets. This results in a lack of an integrated view of the system.

Graphic of performance monitoring for multi-tier web application with each technology group claiming that their application is not the problem.
Multi-tier triage is hard because of the independent and siloed toolsets

Case Study (continued)

The bank’s IT team was challenged by both types of problems stated above.

The service owner was left wondering what the true root cause of the problem could be. Is it the Network? App Server? Custom code? Database?

No solution in sight

Case Study (continued)

Hours later, the team would still be dark on the true root cause. Invariably, the application support team is blamed Graphic of application support team getting blamed for an inexplicable performance problem
for the problem, and bears the brunt of the finger pointing that typically occurs for such inexplicable problems.

How do we go from Silo Management to True Service Management?

You need a single pane of glass that not only pinpoints root cause but also significantly reduces the MTTR.

The results include: Better service quality, satisfied users, productive IT staff, and lower operations costs, all of which can ensure great ROI from service management.

Watch for the next post in this series!

About the Author

Arun Aravamudhan has more than 15 years of experience in the Java enterprise applications space. In his current role with eG Innovations, Arun leads the Java Application Performance Management (APM) product line.

New White Paper: Managing Java Application Performance


photoJava-based applications are powering many business-critical IT services. Performance monitoring and diagnosis of the Java Virtual Machine can provide key insights into performance issues that can have a significant impact on the business services it supports.

For example, a single run-away thread in the JVM could take up significant CPU resources, slowing down performance for the entire service. Alternatively, a deadlock between two key threads could bring the business service to a grinding halt.

Read the white paper “Managing Java Application Performance” and find out how to deliver:

  • Reliable performance assurance and user satisfaction
  • Complete performance visibility across your service environment
  • Automatic, rapid root cause performance diagnosis and analytics for even the most complex performance problems
  • Pre-emptive problem detection and alerting
  • Rapid ROI and cost savings through right-sizing and optimization

Designing High Performance Java / J2EE Applications is not Easy!


Business applications developed in Java have become incredibly complex. Java developers have to have expertise with numerous technologies – JSPs, Servlets, EJBs, Struts, Hibernate, JDBC, JMX, JMS, JSF, Web services, SOAP, thread pools, object pools, etc., not to forget the core Java principles like synchronization, multi-threading, caching, etc. Malfunctioning of any of these technologies can result in slow-downs, application freezes, and errors with key business applications.

Anatomy of a Java developer

One of the articles I was reading last week, i saw a very interesting table that highlighted the different types of failures commonly seen in J2EE applications. Below is an adaptation of this table listing common J2EE problems and their causes. This table gives a very good idea of why designing high performance Java/J2EE applications requires a lot of expertise (and of course, you need to have the right tools handy for you to be able to troubleshoot such applications rapidly, with minimal effort).

JAVA PROGRAMMING DISEASE  DESCRIPTION SYMPTOMS CAUSES OR CURES
Bad Coding; Infinite Loop Threads become stuck in while(true) statements and the like. This comes in CPU-bound and wait-bound/spin-wait variants. Foreseeable lockup You’ll need to perform an invasive loop-ectomy.
Bad Coding: CPU-bound Component This is the common cold of the J2EE world. One bit of bad code or a bad interaction between bits of code hogs the CPU and slows throughput to a crawl. Consistent slownessSlower and slower under load The typical big win is a cache of data or of performed calculations.
The Unending Retry This involves continual (or continuous in extreme cases) retries of a failed request Foreseeable backupSudden chaos It might just be that a back-end system is completely down. Availability monitoring can help there, or simply diferentiating attempts from successes.
Threading: Chokepoint Threads back up on an over-ambitious synchronization point, creating a traffice jam. Slower and slower under loadSporadic hangs or aberrant errorsForeseeable lockup

Sudden Chaos

Perhaps the synchronization is unnecessary (with a simple redesign) or perhaps more exotic locking strategies (e.g., reader/writer locks) may help.
Threading: Deadlock / Livelock Most commonly, its your basic order-of-acquisition problem. Sudden chaos Treatment options include detecting if locking is really necessary, using a master lock, deterministic order-of-acquisition, and the banker’s algorithm.
Over-Usage of External Systems The J2EE application abuses a backend system with requests that are too large or too numerous. Consistent slownessSlower and slower under load Eliminate redundant work requests, batch similar work requests, break up large requests into several smaller ones, tune work requests or back-end system (e.g., indexes for common query keys), etc.
External bottleneck A back end or other external system (e.g., authentication) slows down, slowing the J2EE app server and its applications as well. Consistent slownessSlower and slower under load Consult a specialist (responsible third party or system administrator) for treatment of said external bottleneck.
Layer-itis A poorly implemented bridge layer (JDBC driver, CORBA link to legacy system) slows all traffic through it to a crawl with constant marshalling and unmarshalling of data and requests. The disease is easily confused with External Bottleneck in its early stages. Consistent slownessSlower and slower under load Check version compatibility of bridge layer and external system. Evaluate different bridge vendors if available. Re-architecture may be necessary to bypass the layer altogether.
Internal Resource Bottleneck: Over-Usage or Under Allocation Internal resources (threads, pooled objects) become scarce. Is over-utilization occurring in a healthy manner under load or is it because of a leak? Slower and slower under loadSporadic hangs or aberrant errors Under-allocation: increase the maximum pool size based on highest expected load. Over-usage: see Over-Usage of External System.
Linear Memory Leak A per-unit (per-transaction, per-user, etc.) leak causes memory to grow linearly with time or load. This degrades system performance over time or under load. Recovery is only possible with a restart. Slower and slower over timeSlower and slower under load This is most typically linked with a resource leak, though many exotic strains exist (for example, linked-list storage of per-unit data or a recycling/growing buffer that doesn’t recycle)
Exponential Memory Leak A leak with a doubling growth strategy causes an exponential curve in the system’s memory consumption over time. Slower and slower over timeSlower and slower under load This is typically caused by adding elements to a collection (Vector, HashMap) that are never removed.
Resource Leak JDBC statements, CICS transaction gateway connections, and the like are leaked, causing pain for both the Java bridge layer and the backend system. Slower and slower over timeForeseeable lockupSudden chaos Typically, this is caused by missing finally block or a more simple failure to close objects that represent external resources.

In-depth visibility into the Java virtual machineAs you can see from the above table, monitoring a J2EE application end-to-end requires:

  • Tracking key metrics specific to the application server in use (e.g., WebLogic, WebSphere, JBoss, Tomcat, etc.)
  • Monitoring of the external dependencies of the Java application tier – e.g., databases, Active Directory, messaging servers, networks, etc.
  • Finally, all of the metrics have to be correlated together – based on time, and based on inter-dependencies between applications in the infrastructure, so that when a problem occurs, administrators are equipped to quickly determine what is causing the problem – i.e., network? database? application? web?

Below are several relevant links about how eG Enterprise helps with end-to-end monitoring, diagnosis, and reporting for J2EE applications.

Also of interest is this on-line webinar titled “Managing N-Tiers without Tears”. Click here to view this webinar >>>

Java Monitoring Made Easy! How We Eat Our Own Dog Food :-)


Several years ago when we started to use Java technology in our products, this technology was in its infancy. We had a lot of teething problems, but multi-platform support was important to us and we continued to pull along with Java technologies.

For many years, Java has lacked cost-effective, easy to use tools and methodologies to monitor applications. Troubleshooting was often a manual, trial and error approach. 

As our monitoring application got bigger, troubleshooting just got way more complex. Byte-code instrumentation has been one of the common ways that monitoring and troubleshooting tools for Java applications have used. This technology has been very expensive and resource intensive, and hence, often used in development environments and not in production.

The last couple of Java releases (JDK 1.5 and higher) have incorporated excellent monitoring and diagnostic interfaces that can be used to troubleshoot Java applications. The need to understand how our own Java-based monitoring application necessitated that we take a closer look at these monitoring specifications and interfaces for the Java Virtual Machine.

The result – a new Java monitoring module that is going to be available as an integral part of our next major product release.

Monitoring Java Applications to the Code-level using the eG Java Monitor

This software has been extensively used in our labs for the last couple of years, and i have experienced first hand how effective this technology is. The level of visibility and the precision of the diagnostics are incredible. This module has saved us endless hours of troubleshooting time and we hope that when this gets to our customers, they will benefit in the  same way.

We’re quite excited about this capability – the instrumentation provided in the JVM has been great, and hats off to our developers for providing a very clean and easy to user interface that should be simple to use not just for any support person but will also appeal to any Java programmer because it provides navigation similar to many of the tools built into the JVM.

You can read more about this technology here. You can also take a sneek-peak at this technology by viewing a recorded demonstration here. Please do contact us if you are interested in getting access to an early release of this software.