Sometimes the simplest questions can invoke the most thought. Recently, I asked an impressive group of industry experts (aka Zenoss developers) a seemingly simple question – what is an event?
It might not surprise you that I got quite a few different answers. Really, how hard can it be (asks this marketing guy)? These geeks have spent years buried in code that translates and interprets events. Seriously, they have built a system that is capable of managing over 100 million events a day! Yes, that’s correct, 100 million events (I say, with my pinky extended into the corner of my mouth, reminiscent of Dr. Evil).
After some discussion, the overall consensus became pretty clear – an event is a change in status. Ok, so that would satisfy Wiki or Webster, but what about our users? When we say that we help them manage events, what do we mean? Well, this took a bit more noodling.
“A change in status”
Yes, in order to help our users manage their infrastructure, we have to collect a lot of changes in status.In fact, we have to collect just about every change in status for just about everything in their environment (Como se dice, Unified Monitoring). And that’s where the 100 million plus events come in. Ok, collecting millions of events is cool, but is that really helping anyone? Are they using Zenoss, because they can collect about a billion events a week? No, they use us because plain and simple, we can sort out events that matter to them.
We parse through literally everything that’s going on in their respective infrastructure, and identify the few events that are important to them. This is where a lot the work that our developers do is focused. We do a great job eliminating event storms, and doing nifty stuffs like root cause analysis; our users don’t have to worry about millions of events. In fact (and few actually believe this, until they see it), we can identify and prioritize the one event that is the likely cause of a major outage. That’s pretty cool. So, Zenoss helps users by identifying “ significant changes in status." Needless to say, we've spent a lot of time defining what significant is.
“A significant change in status”
Still, that doesn’t seem to tell the whole story. Because we talk lot about Service Assurance and Service-Orientation, that still has to be part of the definition. Actually, speaking with our developers, this service stuff is what's keeping many of them up at night. Identifying a significant event for an element isn’t rocket science. Anyone can set a notification threshold to critical if a server’s CPU utilization reaches 90%. But what is 90% utilization, for a service? Admittedly, that might not be so hard if all of the elements in said service (eg. Blades, storage, network gear, etc.) were 100% dedicated. However, it gets foggier when some are shared; it gets really hairy when those elements move around a lot (thanks vMotion). Since our developers spend so much time on it – and since our customers seem to value it so much - the addition of service is needed for this definition.
“A significant change in status of a service”
But wait, while IT services are king, nuts & bolts elemental monitoring still hasn’t gone away. At the end of the day, when a service has a problem, users will inevitably have to drill down to its component elements. I guess we have to throw that aspect in too. This seemingly basic question was a lot more complicated, than I had imagined.
Ok, so here’s the result…. Drumroll, please… an event is….
"A significant change in status of an element or service"
Now, this doesn’t include Zenoss events that incorporate BBQ and speedboats; that's a whole different type of event.
What are your thoughts on this? How do you define an event in your environment?