Garbage In, Garbage Out!

How is this relevant for a monitoring and management system? Yes, there is indeed a great relevance!

One of the primary ways that enterprises assess whether a monitoring solution is useful or not is by the number of times the monitoring solution alerted them to a problem before they had a catastrophic failure. Another important metric is when they had a problem, did the monitoring solution provide additional information that helped them drill down to the root-cause of a problem, thereby saving the administrators hours which they would have spent troubleshooting the issue? A truly effective monitoring solution is one that scores high on both these points.

When deciding on a monitoring system, very often administrators focus on the user interface – the jazzy look and feel, pretty graphs and sizzling dials, but not on the specific metrics that the system can collect. Ease of use and pretty interface are important, but they rarely help justify the ROI from a monitoring solution. It is the intelligence embedded in the monitoring agent technology that matters.

Often, we hear enterprises saying we have monitoring agents already on our servers. Can you use the metrics from these agents? This makes sense from the enterprise’s perspective – they don’t want to throw away their investment in monitoring technology. The common perception is that every agent is the same – they all collect the same metrics.

However, the reality is very different. To be truly effective, an agent should have a lot of domain expertise embedded into it. The metrics to be collected should be based on real-world experience with the technologies being monitored. Often, this is not the case. Monitoring software developers tend to start with what metrics are easily obtainable – e.g., what is the SNMP MIB, what perfmon counters are available. In fact, it is fairly easy to collect hundreds or thousands of metrics about a server (one popular management solution from the late 1990s/early 2000s used to be priced per metric collected!). The hard part is finding the few critical metrics that are indicative of a majority of problems with the server.

You can be sure that the application/platform developers who developed the application MIBs, perfmon counters, etc. did not think about how these would be used. Monitoring and management is always an after-thought in the software development process. Hence, relying on metrics exposed by the application developer doesn’t work too well – at least for the first 3/4 revisions of the application.

So when choosing a monitoring solution, look for the degree of intelligence and expertise embedded into the agent for monitoring the applications or infrastructure devices you are looking to monitor. You can be sure that if you don’t pick the right agent technology, it will be a case of garbage in (to the monitoring system) and garbage out (of the monitoring system).

We have found two sources that provide the best insights into what metrics are important for each application: (i) the consulting team of the application vendor – these folks often face the music in production environments and hence, often have come up with scripts and other means of extracting the key metrics that matter as they try to keep customer environments working effectively; (ii) a second source of learning is a proof of a concept (POC) involving the application of interest. The successful vendors look at a POC as a two-way street. While a POC gives the prospect a feel of the monitoring solution, from the vendor’s perspective, it gives them access to a real-world environment and problems. The success of a vendor depends on how well the feedback from a POC is reflected in future product enhancements.

The vendors that succeed are the ones that are focused on acquiring domain expertise and incorporating these into their products.

One thought on “Garbage In, Garbage Out!

  1. Dougie!!! March 31, 2010 / 10:34 am

    Amazing how much GIGO is really out there! I totally agree, anyone can collect metrics – not everyone can turn that into intelligence.

    When approaching correlation, I tend to look at the world as a set of problem sets. How do I navigate these problem sets and present timely, valuable intelligence to folks in a repeatable manner? How can I use that information to positively effect decisions, services, and culture.

    From the realm of understanding the problem set – Its different than understanding how to get metrics or count them. And yes, it does require domain knowledge and expertise. And in delving into new domains and correlation – Its like you have to teach the software about the problem domain. And you never really learn until you TEACH!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s