In my email interview with him, @roidrage (AKA Mathias Meyer) says that most traditional monitoring solutions fail to provide the level of data visibility needed to correlate problems efficiently:
I have a hard time pointing fingers here, but you can pick almost any of the traditional monitoring tools, look at screenshots and see what I mean. Most of them were not buil[t] with aesthetics and efficient data representation in mind. They were just built to present some data, in one way or the other.
Meyer lists three reasons why these tools have become obsolete for present-day operations:
First, their user interfaces were usually not built with a user in mind. They were supposed to be simple tools with efficient user interfaces. But what efficient really means in this context is hard to put a finger on.
Second, they started having problems keeping up with the ever-growing and moving nature of today’s infrastructure, not having been built with lots of moving parts in mind and also have in one way or the other problems keeping up with bigger sites and a large number of machines.
Third, most tools don’t really know what to do with the data they collect, or they don’t collect enough data to make it really useful. They can’t collect more data because they just weren’t built with that in mind, so a lot of them are having a hard time scaling up with the amount of data, which goes hand in hand with the inflexibility of handling ever-growing infrastructure.
Meyer adds that this need for visibility isn’t limited to operations:
Visibility is important to everyone on a team and in the company. So you see new development influences entering the monitoring tool chain just like operations adopt new techniques from development to improve monitoring. The same is true for the business side of things who want slightly different visibility into the app. But in all, everything needs to be easy to correlate for everyone if necessary.
In other words, visibility is key to making #monitoringsucks a thing of the past. But how do we obtain the desired level of visibility in the “ever-growing and moving nature” of today’s infrastructures?
In his already referenced post, Monitoring Sucks. Do Something About It, @obfuscurity (AKA Jason Dixon) writes that perhaps we should consider viewing our monitoring system as a “collection of related services,” rather than “the sum of its parts:”
Now re-imagine monitoring as a collective of dedicated pieces, with their own API and standardized data format. There are collection agents to run checks and gather metrics; a storage engine to aggregate data and respond to queries; a messaging bus to queue and deliver information between components; a state engine to index thresholds, recognize alertable conditions and track alert states; a notification service to page on alerts, handle escalations and scheduling; and of course, a front-end to view dashboards and historical trends.
Given the way we’re moving toward these dynamic and heterogeneous infrastructures, it makes sense to think of monitoring in a similar fashion. And that seems to be where the #monitoringsucks movement is heading!