I was thinking about policies vs. rules the other day, while “laying down the law” for my two boys (ages 5 & 6). “OK, new rule,” I barked, as I started to tell them what time of day they can pull eggs from our chicken coop. Immediately, my 6yr old started to challenge – “but suppose we have to leave early?” Naturally, by him being an inquisitive child, he had about a dozen hypothetical situations that challenged my hard & fast rule. I started to counter each one of them, by adding amendments to the rule. When I tired of the exercise, however, I fell back on a general policy – “try to check for eggs, at the same time every day… and only check for eggs once a day.” This way, I established an expected behavior that wasn’t broken every time an unforeseen condition arose.
IT Operations is essentially stuck in the same dilemma with many of today’s systems management frameworks. The original developers of these systems designed them in a time where there wasn’t as much change (e.g. Server migrations often took weeks, instead of minutes). It made complete sense for them to implement a system of customizable rules (e.g. event filtering, enrichment, etc.). So, if someone wanted to define what constitutes degraded vs. down for a specific type of server, they’d write a rule for it. And if they wanted to suppress a type of alert from a specific system, they could write a rule for that too - and so on, and so on. Yes, making rules can be fun. But maintaining them can be a complete nightmare!
Flash-forward a decade or two, and we’re left with organizations trying to maintain systems with hundreds or thousands of these rules. If they stop maintaining them to support changes, their management consoles begin to light up with event storms and meaningless alerts. The biggest problem with these alerts is their static nature. Most of these systems are hard-coded so that elements within the rules (e.g. a server’s IP address) need to be spelled out explicitly. Migrate that server and your rule is broken. Peruse a few event management forums and/or mailing lists, and you’ll see that maintaining these static rules is the subject of many headaches and/or job security, depending on your perspective.
While I won’t contend Zenoss’ developers are any smarter than the original developers of these legacy event management tools, they did have the luxury of developing a system that was built for the cloud era. So, in light of the fact that servers can and will be migrated in minutes, they assumed that everything will change – and it will change often. Rather than building Zenoss on a system of rules, they built it with more flexible policies. Policies are based on the Zenoss Service Dynamics Model, which maintains parent-child-node (e.g. neural network-like relationships). While there are ways of locking down more static behavior, for a specific element, it’s assumed that one wouldn’t. It’s also assumed that everything will change within the system, and policies should be maintained in-spite of that change.
What does this mean? Well, to put this into context, we had one customer transition from supporting a framework with 1,900 rules, to seven Zenoss policies, making their system a lot easier to maintain. Also, since policies can be aligned to services, organizations can also define them by service level (e.g. the Gold service level is considered “degraded”, at lower rates of utilization than the Silver service level), but I digress.
At the end of the day, the larger issue at hand is that we need to learn to be more flexible, than rigid. We need to weigh the cost of maintaining complexity, within any system, as we add it. As for my boys, they need to check our chicken coop for eggs, once a day, instead of continually pestering those poor hens to see which one of them can collect more of their eggs.