June 7th's webinar Eliminating Event Storms brought back Forrester consultant Jean-Pierre Garbani to discuss the ramifications of having too many monitoring tools. Rather than focusing on the numbers, JP addressed the potential consequences that can result from not having a way either to get monitoring tools to work together or being unable to glean actionable knowledge from the firehose of data gushing through these various monitoring pipelines.
And JP used a harrowing example to prove his point. According to the National Commission on the BP Deepwater Horizon Oil Spill, all the information needed to prevent the explosion of the Deepwater Horizon rig had been reported on its event console:
-
It was relying on a person to actually recognize that there was a problem...tens of thousands of events were reported, and it would [have taken] a genius to actually understand how to piece these different events together so that [they would have known] there was going to be a problem in a few hours or minutes.
In other words, the 11 deaths on the rig, the destruction of wildlife habitats, the ruination of hundreds of fishing- and tourist-based livelihoods, and the health consequences for humans and animals alike could have been prevented with better monitoring tools. Better monitoring tools. Just let that sink in for a minute. If this was true three years ago, it is even truer today as the data we collect grows exponentially in amount and in complexity.
According to JP, the Commission even recommended in its report that BP et al needed to buy monitoring tools that do a better job of managing events. But as JP then pointed out:
-
Here in IT we buy roughly $25 billion in IT management tools every year. And guess what? We're still in the same situation.
Why? Because monitoring tools alone won’t bring about a solution. Even if your monitoring tools were able to work together in a coherent fashion (although the Forrester study shows that this is rarely the case), you’re still going to have potentially catastrophic problems if the various departments of your IT organization don’t share a common language to respond to these so-called events.
Forestalling Future Crashes/Speaking Same Language
After this webinar, I spoke briefly with Deepak Kanwar of Zenoss about the themes JP brought up, and Deepak reminded me about Avianca Flight 52. This Colombian airliner crashed outside of New York City back in 1990. The official reason? The plane ran out of fuel. But plenty of planes running out of fuel have made safe landings. The ultimate cause for this plane’s demise was a language barrier.
In trying to secure priority landing, the Colombia-based flight crew told the Air Traffic Controllers that they were running out of fuel, but never used the word “emergency!” The English-speaking air traffic controllers who are used to every plane “running out of fuel” didn’t realize the plane was in trouble until it was too late to prevent the crash. Without the use of the key word “emergency” the context was simply not there.
JP spoke of a similar “language barrier” between the various IT departments and the monitoring tools used within these departments:
-
The result is that we have this impossible integration not only between products but also between events. Products integrations [are] mostly impossible because they are using different timelines, different technologies to collect information. Normalization is almost impossible, so integration is difficult. But integration between different events from a human standpoint, as we have shown with the [Deepwater] drilling platform, that's almost impossible after a certain level of volume.
This blog has harped on the need to break down the walls of confusion among IT silos for quite awhile, but this need becomes critical as complexity within data centers becomes too great for even a “genius” human to organize in a way that can be understood and acted on. Hopefully this communication disconnect won’t lead to death and environmental devastation as it did for BP, but you do risk inconveniencing your customers and ultimately crippling your business, which is painful in its own right.
Participation vs. Exoneration
I realize I’m really pushing doom and gloom scenarios to prove my point. But every time your most critical business application unexpectedly goes down, you face losing beaucoup bucks caused by customers who can’t access that app and by having to divert IT professionals from their normal tasks to resolve the problem. As JP noted:
-
We have built the IT organization on the assumption that it's working, so any problem, any incident is part of extra, unscheduled, unplanned work...Think what it could do to your budget if you could get those hours back.
But better tools are only as good as the people who use them. For his part, JP worries that IT professionals are too reactive when tackling these seemingly intractable problems:
-
It's as if each time we have an issue and we realize we didn't have the right tool, we turn around and we go buy the “right tool,” so we end up with a lot of tools. Because the problems are in different silos of expertise within IT, we tend to have a bunch of things in network management, server management, application performance etc., and none of these things are actually talking to [one another] or bringing correlated or filtered information that we can piece together.
JP also brings up a point that I haven’t seen discussed elsewhere (although if you have, please post in the comments!) that too often is overlooked:
-
We constantly try to get together when there is a problem and try to find a solution amongst ourselves, but end up spending most of the time trying to exonerate ourselves.
This combination of reactivity and the need to CYA can hamper a good monitoring solution, even if you’re using the best tools out there. According to JP, that need for the latter has a lot to do with lack of trust between the different silos:
-
If I am in server infrastructure and management, I may not trust my colleague in network management or infrastructure, and I may find some sort of little product that will tell me a bit about how the network is affecting the server and how do I consume the bandwidth.
Scenarios like the one described by JP leads to multiple tools that perform much of the same tasks in a language that people in a given silo understand but that doesn’t translate across multiple silos, let alone provide consistent information. But as I’ve already discussed, most businesses can’t afford to take these approaches. I doubt anyone who worked on the IT infrastructure for the Deepwater Horizon rig is going around saying, “It’s not my fault. I work in application development!” Ultimately, you are part of your business, and for the sake of that business, as well as others, it’s time to change “Not my fault,” to “How can we work together?”
If you want to drill deeper into this topic, I highly suggest watching this webinar below, which is now on Zenoss’ YouTube channel.
http://youtu.be/I2oXkyvqrkU
In addition to JP’s analysis, Zenoss Senior Systems Engineer Michael DeSimone runs a cool demo on the way Zenoss Service Impact identifies and alerts you to event storms.