4 Keys to a Successful Citrix Migration


bitsIf you are considering a Citrix migration, it is very likely that one or more of needs is driving your decision making.

  • Your Citrix licenses are about to expire and you are concerned about losing access to Citrix support
  • It is time to move from a less secure messaging, email or collaboration application to a more secure and flexible Citrix solution.
  • You are not sure how you are going to measure key performance before during and after you migrate to Citrix XenMobile MDM
  • You want and need access to more in depth reporting for all aspects of your environment before a migration plan is created
  • Application “slow-time” or intermittent issues are affecting the productivity of end users and it needs to be tracked down and eliminated before a migration can take place

Everyone in IT from the CEO to IT Managers and Admins all have their own concerns and perspectives about initiating a Citrix migration and they often overlap.

Executives and IT Managers care about automation, modernization, reducing OPEX, increasing productivity and maintaining a seamless user experience.

While they are concerned with the strategic business needs and outcomes, the responsibility for delivering a Citrix migration on time, on budget and seamlessly to end users rests on the shoulders of the admins in the trenches.

Simply stated, migrations are difficult, managing multiple apps, platforms and domains all with their own tools, different levels of visibility, unique interfaces and data, makes it hard to pinpoint and troubleshoot the root-cause of problems when they occur.

IT needs a single solution that provides universal insight, correlates the performance of all interdependencies and uses a simple methodology to help accelerate time to resolution.

They need advanced KPI and performance metrics so they can profile performance and model the new environment easily in order to bring the migration project in on time and on budget.

So how important is performance management to a migration?

Bloor Research and the Harvard School of Business found the following to be true after studying almost 1,500 different enterprise migration projects.

  • 28% of migrations result in cost overruns
  • 38% of migrations result in cost overruns, failure or are abandoned
  • 17% of migrations result in a cost overrun of 200% and 70% schedule overrun

Their research revealed there are three common reasons why migration projects are abandoned; fail or suffer from time and budget overruns.

  • Limited or no insight across enterprise interdependencies
  • Limited or no KPI data profiling solution
  • Limited or no data driven decision making

When organizations were trying to plan their migration or manage business and IT changes due to acquisition, merger, consolidation or virtualization, either a KPI data profiling solution wasn’t in the original budget or a KPI profiling solution was cut from the budget leaving the project at risk.

KPI Data StatsBut all is not lost, they also discovered the following encouraging statistics during their research.

  • 72% of migrations that are both on time and on budget included a KPI data profiling solution as part of the original budget plan before work began.
  • 62% of ongoing migration projects can be brought in on time and on budget by implementing a KPI data profiling solution

Their research suggests the following.

  • Only two-thirds of data migration projects are on time and on budget.
  • CIOs can be “firewall executives” and reduce the risk of potential damage to the company’s image and reputation from a failed or late project by instituting a KPI data profiling solution.
  • Establish funding for a KPI and data profiling solution as part of a migration plan before the project begins!
  • Using a tried and tested methodology for measuring performance can help ensure a successful migration

Successful Migrations Start with Universal InsightWe have determined at eG Innovations that there are four keys to a successful Citrix migration.

  1. Universal Insight Across the Enterprise
  2. Testing and Troubleshooting
  3. Building accurate Performance Profiles
  4. Maintaining a Positive End User Experience

eG Enterprise and the universal insight it provides can help you with multiple aspects of your next Citrix migration project by making it easier to do the following.

Identify all interdependencies regardless of where they reside

  • Test and troubleshoot issues that could result in downtime, slow-time, budget and schedule overruns
  • Maintain a working coexisting model and active user profiles until migration is complete
  • Prototyping, profiling and right-sizing the new environment
  • Verify that performance expectations are met before archiving old data and retiring obsolete systems

Here is a breakdown of the four keys to a successful Citrix migration.

Successful Citrix Migrations START with having universal insight across the enterprise. That includes all interdependencies, from apps to platforms and domains whether in the data center, virtual space or the cloud. Everything from end user experience metrics to network latency, application and database responsiveness, server health, the virtual machines and OS, to CPU, memory, disk resources and more.

Testing and TroubleshootingTesting and trouble-shooting an environment proactively is key to ensuring that the migration process is seamless and doesn’t negatively affect end user productivity. Correlated performance metrics will help reveal hidden or intermittent issues that exist within the current environment so they can be avoided within the NEW environment. It will also help admins quickly determine where the actual root-cause is so they don’t waste time diagnosing symptoms.

With eG Enterprise once an admin is alerted to an issue they can very quickly in just a few clicks drill down to the root-cause of the problem and determine a rapid solution such as locating where vm resources are constrained and moving the workload or adding more resources to remove the bottleneck.

Environment ProfilingBuilding performance profiles for both the current and the new Citrix environment is imperative to success. Using a solution that correlates performance metrics across all interdependencies within a single interface makes it much easier to measure and establish baseline metrics for the behavior and performance of current apps, databases, OS and supporting hardware. eG Enterprise empowers data driven decision making when modeling, profiling and right sizing the new environment. KPI data and performance profiling may be helpful with other decisions such as deciding

  • NEW standard images and profiles
  • What licenses and drivers as necessary
  • Applications required for all systems in NEW standard images
  • How to prioritize the different phases of the migration
  • What data and apps will be migrated and what end users will be required to transfer
  • What workloads or vm resources need to be load balanced
  • What can be archived, refreshed, repurposed or retired

Proactive MonitoringProactively managing and monitoring performance is what helps IT maintain a positive end user experience. eG Enterprise provides the universal insight needed to see and measure all of the interdependencies that can affect end user experience and productivity.

That includes but is not limited to log on times, CPU, memory resources, I/O reads and writes, latency, Storage Zones, NetScaler devices, XenApp, ShareFile, VMware, SQL, Oracle and more. Upon migration to the new environment and before repurposing or retiring any of the old systems its best to let the new environment run for several weeks to ensure availability, stability and to ensure that any intermittent issues are identified and resolved.

For more information about making eG Enterprise the center of your next Citrix migration, for a free trial, to schedule a live demo a request to info@eginnovations.com or go to www.eginnovations.com

How to Resolve a Complex IT Problem in Just a Few Clicks with eG Enterprise


topimg_22877_complexity_600x400

One of the most common and yet difficult things for any admin to accomplish is to trouble shoot end user “slow-time” issues. Application, database, network and server unresponsiveness or “slow-time” negatively affects enterprise performance and end user productivity ten times more often than downtime and can originate from just about anywhere within the enterprise.

Misconfiguration due to human error, missing drivers, intermittent memory faults, network IP cache errors, unbalanced workloads and constrained virtual resources can all be the root-cause of slow-time, or they could just be a resulting symptom, the key to resolving such issues is getting to the root-cause quickly before they spread to other systems and bring productivity to a standstill.

In the following walkthrough, I detail how such a scenario can be resolved quickly and easily before end users notice with the help of eG Enterprise. This example focuses primarily on Citrix and vmware but eG Enterprise can help IT departments maintain maximum productivity for millions of combinations of enterprise components.

eG Enterprise is a 100% web based solution making it possible for anyone in IT from the CIO and IT Managers to admins and helpdesk specialist to proactively monitor their environment anytime, anywhere on any device.

eG Alarms Window for Blog

When a “slow-time” error occurs, eG Enterprise automatically generates an alarm to the appropriate admin so they can take action immediately. The solution correlates and color codes the minor, major and critical alerts and displays them using a layer model with the most critical alert at the top.

eG Alarms Window Details for Blog

According to the alert Virtual CPU usage in the vmware ESX system console is high. The system console is a bootstrap operating system of ESX and should only be using about 2% of the CPU allocated to it, by scrolling over the description we can see the usage has suddenly increased to 100%; left alone this would surely affect Citrix performance and generate a large number of support calls to IT from end users.

eG Detailed Diagnosis Window for Blog

Fortunately, eG Enterprise patented detailed diagnosis technology makes identifying the root-cause a breeze. By scrolling to the right and using the magnifying glass icon, the root-cause is revealed within the Detailed Diagnosis window. The window is displaying information for the top 10 processes using virtual CPU resources; to the right of the window those processes are listed as SAMBA backups.

eG Fix Feedback Window

The root-cause is simply this, the vmware admin is performing a normal backup but it is taking place before the end of the workday potentially affecting Citrix users when they attempt to log on and access applications. The best solution is to contact the vmware admin, explain the situation and either agree to reschedule the backups or adjust virtual resources.

In this case, the Citrix and ESX admin agree to reschedule the backups and then the Citrix admin uses the Fix Feedback feature within eG Enterprise to document the event, as well as the agreed solution and save the record. Resolving the issue took just a few clicks and a quick phone call between admins.

eG Enterprise provides universal insight across platforms and domains whether they exist in the Cloud, the data center or in virtual space, it is for this reason that the Citrix admin had the visibility they needed to identify the root-cause as a virtual resource constraint within the vm’s that support Citrix.

The following is a more detailed look at the Citrix admins view of the eG Enterprise Universal Insight dashboard as well as the methodology and technology behind the solution.

eG Universal Insight Dashboard for Blog

The color codes of eG Enterprise are familiar, green is “Normal”, yellow is a “Minor” alert, orange is a “Major” alert and red is a “Critical” alert requiring immediate attention.

The dashboard provides universal insight for the 12 different components that comprise the two Citrix services they are monitoring. The Component Type panel lists the details for each of the 12 components.

Clicking on the listed services in the middle of the Infrastructure Health panel on the left will reveal which of the two services is generating alerts.

The Measure at – A – Glance panel at the bottom left lists the measurements and tests conducted for each of the 12 components that comprise the two different Citrix service being monitored. Details include CPU utilization, Free memory, Active Citrix sessions, and more.

The bell icon on the top right of the window is a link to the alarm window details viewed previously.

eG Infomart Services Window for Blog

After clicking on the Services panel eG Enterprise presents two Citrix service icons for the different services, it appears that Infomart is the service experiencing major issues, clicking on the Infomart icon opens the list of Web Transactions for the Infomart service.

eG Informart Transactions Window for Blog

The list indicates that Application Access and User Logons are experiencing errors. The Citrix admin may click on the Topology tab or the transactions themselves for a service topology graph for the Infomart service.

eG Citrix Topology Window for Blog

Navigating the Infomart service topology from left to right, end users are connecting to Infomart through a network node, then a web server that is experiencing minor errors; the requests then reach a Citrix Zone Data Collector, which sends the request to one of the Citrix XenApp servers, which is experiencing major errors. The XenApp server then accesses the appropriate file, print or database server on the backend.

Based on the color codes eG Enterprise is indicating that the primary focus should be on the Citrix XenApp servers. Clicking on them will present the Citrix admin with either a physical or a virtual topology for Citrix XenApp depending on the supporting host.

eG Virtual Citrix XenApp Topology Window for Blog

The virtual topology indicates that the vmware ESX virtual machine hosting the Citrix Zone Data Collector, the Web server and Citrix XenApp are experiencing critical errors, that is where the focus needs to be. Clicking on that area of the virtual service topology reveals the elements that support those virtual machines.

eG Layer Model Window for Blog

This is the eG Enterprise Layer model for the Citrix XenApp service topology. On the right are all of the elements that support the virtual machines as well as the various tests that correspond to that layer. As an example within the OS layer for the virtual hypervisor are measurements and tests for the System Console, CPU, Disk Space and more; depending on which layer is selected the information within the right panel changes accordingly.

eG Detailed Diagnosis Window for Blog

eG Enterprise has already identified that virtual CPU resources within the system console are constrained. Using the magnifying glass icon on the far right opens the same Detailed Diagnosis window previously accessed from the Alarms window and the same SAMBA errors are viewable, this confirms the previous diagnosis.

Regardless of the path an admin chooses to use, identifying the root-cause of a complex IT problem takes just a few clicks with the eG Enterprise.

eG Infrastructure Health Reporting Window for Blog

The last thing I will cover are some of the reporting benefits that eG Enterprise provides. By returning to the Universal Insight dashboard and selecting the Reporter tab, anyone in IT can pull performance reports for the infrastructure.

Reports are available based on Function, Component, Service, or Segment; two of the more important are Operational KPI and Capacity Planning. Easy access to comprehensive reporting make it possible to maintain business continuity, predict peak needs, ensure future readiness for emerging technologies while keeping costs down and increasing productivity.

The eG Enterprise methodology is simple, the technology is powerful and the universal insight is comprehensive.

For a free trial, to schedule a live demo or obtain more information about eG Enterprise send a request to info@eginnovations.com or go to our website at www.eginnovations.com

Reduce Downtime and Slow-time in the Borderless Enterprise


nextgov-mediumThe era of the borderless enterprise is here!

Ensuring on-demand availability of apps, data and seamless collaboration via the cloud, across the web, virtual networks, servers and storage requires complete and total visibility across platforms, domains, time zones and geographies.

When apps and databases are unavailable or slow and unresponsive the downstream impact can be truly devastating. Customer loyalty suffers, industry reputations become damaged, and strategic partnerships dissolve, legal liabilities increase, stock values decrease and financial markets collapse.

Do you know how many man hours are affected by downtime and slow-time each year by just the Fortune 500?

Do you know the estimated compensation cost of downtime and slow-time compared to annual GDP?

Are you a CIO or IT Executive and want to know more?

Within our most recent Enterprise Solution Brief “Reducing Downtime and Slow-time in the Borderless Enterprise” you will discover the following;

  • 3 key benefits of eG Enterprise Universal Insight within the Borderless Enterprise
  • The REAL Cost of Downtime and Slow-time
  • Why “Universal Insight is Required”
  • How eG Enterprise Universal Insight is “Ready for the Future of the Borderless Enterprise”
  • How eG Enterprise Universal Insight is a “Force Multiplier for Enterprise IT”

Read the brief then let us show you the numerous ways that eG Enterprise Universal Insight can help you reduce downtime and slow-time, enhance IT service performance, increase operational efficiency, ensure IT effectiveness.

Email info@eginnovations.com

Call 866.526.6700

Enhance Healthcare IT and Improve Patient Care


innovationwordcloud400FACT: Preventable medical errors are the 3rd leading cause of death in the United States right behind heart disease and cancer. It doesn’t seem possible but according to “Death by Medical Mistakes Hits Record” in Healthcare IT News it’s true and the related financial costs may be as high as $1 trillion dollars.

Certainly many of the deaths are due to situations Healthcare IT has no influence over, but some are due to preventable problems such as inaccurate and missing records or systems which are completely down, experiencing application or database slow-time when doctors, clinicians and patients are in critical need.

According another Healthcare IT News article “The True Nature of Recovery: 5 Ways to Mitigate Downtime, Data Loss” there are a number of things that Healthcare CIOs and IT managers can do to mitigate the risks to both patients and the bottom line of healthcare organizations. Though the article is focused on disaster recovery it is very quick to point out that knowing your systems, avoiding points of failure and ensuring that you have the data necessary to right-size, avoid cost over runs and prepare for emerging technologies are highly important.

What Healthcare IT organizations need is a solution that helps reduce IT complexity, increases performance, tracks app adoption, offers role based reporting and grants CIOs, Managers and admins “universal insight” across the solution stack as well as all other interdependent physical and virtual layers; and they need the ability to do it all from just a single interface instead of multiple silo-centric tools.

Within our most recent Enterprise Solution Brief specifically for Healthcare IT you will discover

  • 4 key benefits that improve Healthcare IT efficiency and result in enhanced patient care
  • What factors are driving “Healthcare IT Transformation
  • How to “Enable Successful Healthcare IT Transformation
  • How eG Enterprise Universal Insight is “Ready for the Future of Healthcare IT”
  • How eG Enterprise Universal Insight is a “Force Multiplier for Healthcare IT”

Read the brief then let us show you the many ways that eG Enterprise Universal Insight can enhance your Healthcare IT department, improve patient care and save lives.

Email info@eginnovations.com

Call 866.526.6700

In 2015 CIOs Will Need Universal Insight Across the Enterprise


Imagekid Silo-centric monitoring and management tools are fine for supporting a confined infrastructure with very limited interdependencies but for CIOs seeking to prevent drains on CAPEX and ensure they are contributing to OPINC they aren’t enough.

Resolving IT pain and avoiding unscheduled downtime is the traditional focus of APM and NPM solutions and it will continue to be important in 2015 but emerging technologies like Cloud, mobility and XaaS are reinventing the role of the modern CIO.

Those who virtualized and consolidate their data center servers over the last ten years are now extending virtualization across the enterprise to include networking, storage, cloud and mobility deployments.

With the new era of the borderless enterprise upon them CIOs are keenly aware of the need to research and adopt solutions that both resolve traditional IT pain while giving them universal insight across the enterprise so they can ensure emerging technologies deliver greater value to the business.

Below are some of the predictions for 2015 that will shape the need for CIOs to have universal insight across the enterprise…

Enable workspace flexibility. IT executives are being challenged to adopt and integrate mobile solutions at a blinding pace but 70% of end user devices cannot pass basic compliance and security tests, so introducing foreign devices on the corporate network poses serious risks. CIOs will need to ensure availability while managing access, device and user compliance and security, having universal insight across user and device profiles, approved and blacklisted apps, databases and domains will be critical to success.

XaaS becomes the new IT stack for Hybrid Cloud. The era of everything as a service has arrived. The development of virtual cloud and mobility apps are driving IT innovation and the consolidation of intellectual properties at a blinding pace. To maintain market share and demonstrate thought leadership traditional product centric companies will accelerate efforts to bring new XaaS offerings to market. The rollout and adoption of new service offerings like vDaaS, DRaaS, IaaS, MWaaS, PaaS, and WPaaS with grow and mature in 2015 and throughout the remainder of the decade.

The borderless enterprise explosion will usher in a new era of compliance and security. The gaps and interdependencies between cloud, mobility, virtual and shared infrastructures, social media platforms and SaaS will inspire a renewed focus on compliance and security the way email, malware and network security have before. APM providers will be faced with some interesting choices; either acquire or develop additional internal compliance and security expertise, partner with an existing security provider or remain focused on their existing silo niche. CIOs will be left with deciding to go all-in with a security provider, purchase silo-centric solutions that provide limited compliance and security visibility, or evaluate and choose an APM NPM solution that meets most of their needs now as APM NPM compliance and security maturity continues to grow.

Containerization remains a test and development play, for now. Game changer, disruptive and death knell are all phrases tossed about when containerization solutions are discussed as an alternative to traditional virtualization but the reality is much less dramatic. Container solutions are well suited for accelerating Linux app portability and reducing associated overhead, Google and Facebook have deployed containers very successfully but their demands for rapid deployment and scale are different from most corporate customers. Until container solutions are cross compatible, offer mature management options and enhanced security capabilities expect containers to remain a solution for Linux test and development environments while traditional VMs meet the majority of data center production demands.

Data is the new natural resource and almost as important as air and water. Ensuring on-demand availability of data and seamless collaboration via the cloud, across virtual networks, servers and storage will require that CIOs have complete and total visibility across platforms, domains, time zones and geographies in 2015. Ensuring maximum uptime, preventing downtime and performing root-cause analysis are now IT table stakes. Whereas having universal insight across the enterprise, providing a seamless user experience, preventing slow-time and increasing productivity via the borderless enterprise are the new business goals CIOs must be aligned with and focused on.

Ensure IT effectiveness and business alignment. CIOs must align IT initiatives with desired business outcomes for productivity, growth and profit. APM historical performance reports provide the empirical data they need to help them balance workloads, right-size the enterprise and eliminate cost overruns so capacity planning meets the business needs of today while preparing for the emerging technologies of tomorrow.

Deliver a positive user experience through enhanced service performance. End users judge their experience relative to their ability to be productive and complete an end goal. Whether end users are employees and partners seeking to work seamlessly between the office and a mobile device as they move across domains or customers accessing a web cart, they all expect apps and databases to be available, accessible and responsive. CIOs will rely heavily on APM solutions to provide KPI for user logons, average response time, page loads, app adoption, abandonment rates and other metrics to ensure that end users are happy and productive.

Expanded use of KPI metrics. What started with call center, helpdesk and customer service metrics is expanding rapidly. APM solutions that can be adapted to collect KPIs for industry and role specific applications are influencing the decision making of CEOs, CFOs and other executives. APM solutions will be used to measure and determine the viability of pilot programs, industrial expansion and even the purchase of competing intellectual properties.

Improve operational efficiency. Reliance on command line interfaces and technology trees is functional but outdated. CIOs will arm and empower IT managers, admins and specialists with APM solutions that are customizable, intuitive, integrate easily with existing NOC tools and provide a unified view of the enterprise. The end goal will be to accelerate time to resolution, eliminate guesswork, reduce dependency on multiple silo-centric tools with limited visibility and mitigate the impact that natural attrition has on tribal knowledge.

For CIOs to achieve these goals and remain future ready they will need universal insight across the enterprise and timely, correlated information that enables data driven decision making. 2015 will be a very interesting year as CIO thought leaders seek to improve the end user experience and enhance productivity within the enterprise.

About eG Innovations eG Innovations is dedicated to helping businesses across the globe transform IT service delivery into a competitive advantage and a center for productivity, growth and profit. Many of the world’s largest businesses use eG Enterprise Universal Insight technology to enhance IT service performance, increase operational efficiency and ensure IT effectiveness. Visit here for more information.

Unified Monitoring, Diagnosis and Reporting of IT Infrastructure Performance with eG Enterprise v6 [LIVE DEMO]


Join the live demo “Unified Monitoring, Diagnosis and Reporting of IT Infrastructure Performance with eG Enterprise v6” on October 9, 2014 at 11am ET | 10am CT | 8am PT | 5pm UK | 5pm CET.

Register now: https://www4.gotomeeting.com/register/971626975

eG Enterprise v6
See live the brand-new release of eG Enterprise v6 – the first intelligent performance monitoring solution designed to simplify the management of today’s complex and distributed IT environments. Find out how eG Enterprise helps you make IT Operations more productive, reduce IT support cost & complexity, and keep your end users happy & productive. During the live demonstration, we will show how you can:

  • Have a single unified solution that addresses your application monitoring, database monitoring, server monitoring, network monitoring, virtualization monitoring, service monitoring and even mobile device monitoring needs;
  • Use intelligent analytics to analyze and correlate performance across the tiers to provide unparalleled speed & ease of proactive alerting, diagnosis & analysis;
  • View best-in-class customizable dashboards that integrate performance metrics to provide real-time role-based and domain-based views on user experience, system and service health, resource consumption, capacity and more;
  • Report on historical performance and trends and analyze usage patterns to right-size and optimize your IT infrastructure for maximum ROI;
  • Address gaps in your current monitoring for Citrix XenApp/XenDesktop, virtual desktop infrastructures (VDI), multi-tier Java applications and heavily virtualized IT environments – in the cloud or on-premise;

Title:  Unified Monitoring, Diagnosis and Reporting of IT Infrastructure Performance with eG Enterprise v6

Registration:  https://www4.gotomeeting.com/register/971626975

Date:  October 9, 2014 at 11am ET | 10am CT | 8am PT | 4pm UK | 5pm CET

Presenters: Bala Vaidhinathan (CTO, eG Innovations), Holger Schulze (VP Marketing, eG Innovations)

  • Are you having to spend hours troubleshooting problems by looking at multiple different tools?
  • Yearn to have a single pane of glass view into your entire IT infrastructure?
  • Wish you could drill down and with just one click determine where the root-cause of a problem lies and call the right expert to get it fixed quickly?

Get your answer on October 9.

Troubleshooting Java Application Deadlocks – Diagnosing ‘Application Hang’ situations


Users expect applications to respond instantly. Deadlocks in Java applications result in ‘application hang’ situations that result in unresponsive systems and poor user experience.deadlock

This blog post explains what deadlocks are, consequences of deadlocks and options to diagnose them.

In a subsequent blog post, we’ll explore how the eG Java Monitor helps in pinpointing deadlock root causes down to the code level.

 

A typical production scenario

It is 2 am in the morning and you get woken up by a phone call from the helpdesk team. The helpdesk is receiving a flood of calls from application users. The application is reported to be slow and sluggish. Users are complaining that the browser keeps spinning and eventually all they see is a ‘white page’.Graphic of sys admin having to troubleshoot at night

Still somewhat heavy-eyed, you go through the ‘standard operating procedure’. You notice that no TCP traffic is flowing to or from the app server cluster. The application logs aren’t moving either.

You are wondering what could be wrong when the VP (Operations) pings you over Instant Messenger asking you to join a war room conference call. You will be asked to provide answers and pinpoint the root cause – fast.

What are Java application deadlocks?

A deadlock occurs when two or more threads form a cyclic dependency on each other as shown below.

In this illustration ‘thread 2’ is in a wait state waiting on Resource A owned by ‘thread 1’, while ‘thread 1’ is in a wait state waiting on Resource B owned by ‘thread 2’.Graphic of deadlocked threads having a circular dependency

In such a condition, these two threads are ‘hanging’ indefinitely without making any further progress.

This results in an “application hang” where the process is still present but the system does not respond to user requests.

“The JVM is not nearly as helpful in resolving deadlocks as database servers are.thumbnail graphic of java_concurrency_in_practice_book

When a set of Java threads deadlock, that’s the end of the game. Depending on what those threads do, the application may stall completely”

Brian Göetz et al, authors of “Java Concurrency in Practice”

Consequences of deadlocks

1. Poor user experience

When a deadlock happens, the application may stall. Typical symptoms could be “white pages” in web applications while the browser continues to spin eventually resulting in a timeout.graphic of browser timeout

Often, users might try to retry their request by clicking refresh or re-submitting a form submit which compounds the problem further.

2. System undergoes exponential degradation

When threads go into a deadlock situation,graphic of long-queue-people they take longer to respond. In the intervening period, fresh set of requests may arrive into the system.

graphic of how fresh requests would cause exponential system degradation due to backlogged threads

When deadlocks manifest in app servers, fresh requests will get backed up in the ‘execution queue’. Thread pool will hit the max utilization thereby denying new requests to get served. This causes further exponential degradation on the system.

 3. Cascading impact on the entire app server cluster

In multi-tier applications, Web Servers (such as Apache or IBM HTTP Server) receive requests and forward it to Application Servers (such as WebLogic, WebSphere or JBoss) via a ‘plug-in’ .cascading effect

If the plug-in detects that the Application Server is unhealthy, it will “fail-over” to another healthy application server which will accept heavier loads than usual thus resulting in further slowness.

This may cause a cascading slowdown effect on the entire cluster.

Why are deadlocks difficult to troubleshoot in a clustered, multi-tier environment?

Application support teams are usually caught off-guard when faced with deadlocks in production environments.

hot-tip

1. Deadlocks typically do not exhibit typical symptoms such as a spike in CPU or Memory. This makes it hard to track and diagnose deadlocks based on basic operating system metrics.

2. Deadlocks may not show up until the application is moved to a clustered environment. Single application server environments may not manifest latent defects in the code.

3. Deadlocks usually manifest in the worst possible time – heavy production load conditions. They may not show up in low or moderate load conditions. They are also difficult to replicate in a testing environment because of the same load condition reasons.

Options to diagnose deadlocks

There are various options available to troubleshoot deadlock situations.

1. The naïve way: Kill the process and cross your fingers

You could kill the application server process and hope that when the application server starts again, kill_processthe problem will go away.

However restarting the app server is a temporary fix that will not resolve the root-cause. Deadlocks would get triggered again when the app server comes back.

 

2. The laborious way: Take thread dumps in your cluster of JVMs

You could take thread dumps. To trigger a thread dump, we have to send a SIGQUIT signal. (On UNIX, that would be a “kill -3” command and on Windows, that would be a “Ctrl-Break” in the console).

Typically, you would need to capture a series of thread dumps (example: 6 thread dumps spaced 20 seconds apart) to infer any thread patterns – just a static thread dump snapshot may not suffice. thread_dump

If you are running the application server as a Windows service (which is usually the case), it is a little more complicated. If you are running the Hotspot JVM, you could use the jps utility in order to find the process id and then use the jstack utility in order to take thread dumps. You can also use the jconsole utility to connect to the process in question.

You would have to forward the thread dumps to the development team and wait for them to analyze and get back. Depending on the size of the cluster, there would be multiple files to trawl through and this might entail significant time.

This is not an optimal situation you want to be at 2 am in the morning when the business team is waiting on a quick resolution.

cautionManual Processes to troubleshoot deadlocks can be time consuming

The manual approach of taking thread dumps assumes that you know which JVM(s) are suffering from deadlocks.

Chances are that the application is hosted in a high-availability, clustered Application Server farm with tens (if not hundreds) of servers.

complex_architecture

If only a subset of JVMs are undergoing the deadlock problem, you may not be in a position to precisely know which JVM is undergoing thread contention or deadlocks. You would have to resort to taking thread dumps across all of your JVMs.

This becomes a trial-and-error approach which is laborious and time consuming. While this approach may be viable for a development or staging environment, it is not viable for a business-critical production environment where ‘Mean Time To Repair’ (MTTR) is key.

 

3. The smart way: Leverage an APM

While an APM (Application Performance Management) product cannot prevent deadlocks from happening, they can certainly provide deeper visibility into the root cause down to the code level when they do happen.Smart way

In the next blog post, we’ll explore how the eG Java Monitor can help provide an end-to-end perspective of the system in addition to pinpointing the root cause for deadlocks.

Stay tuned!

 

About the Author

Arun Aravamudhan has more than 15 years of experience in the Java enterprise applications space. In his current role with eG Innovations, Arun leads the Java Application Performance Management (APM) product line.