When I shop for anything, whether it’s a book, air-conditioning filters, or dog toys, I go to Amazon first. Why? Because it’s easy, it’s reliable, and it’s fast. I pay $79 per year for Amazon Prime, which guarantees two-day shipping, and frequently my stuff comes the next day. I know how to navigate through its website, and I can’t remember a time in the last decade where I had to wait for a page to load.
Of course, Amazon’s website is its business. Every second of the day, their website handles an average of $2,200 in sales (based upon annual revenue of $70 billion). With stakes that high, there is no room for disruption or even degradation. The company has systems in place to maintain as close to 100% availability as possible. Without a doubt they have infrastructure, processes and tools in place that work together to prevent issues.
It helps that Amazon and Google have seemingly unlimited funds to build in ridiculous levels of redundancy and capacity to ensure its customers always have a smooth experience. A handful of other companies may have the same luxury and if yours is one of those companies, you can stop reading now.
But if you’re responsible for IT ops for the other 99% of companies that must adhere to limited budgets, let’s chat about how you currently manage your environment.
How do you handle problems in your data center?
If you’re like many IT organizations, not well. Forrester just released a Zenoss-commissioned survey of 157 IT professionals that showed (among other things):
-
Availability and performance issues often are a daily occurrence;
-
Service-centric problems often aren’t discovered until end users complain to help desks;
-
Although monitoring tools are plentiful, they don’t pinpoint problems; and
-
Identifying the root causes of problems takes too much time and too many resources.
Much of the problem has to do with increasingly complex hybrid IT environments. Of these respondents:
-
62% work in data centers that had at a minimum of at least 200 physical and virtual servers.
-
65% have a minimum of 200 network components like routers and switches.
-
70% run a minimum of 20 applications and services.
-
68% have a minimum of 20 storage components.
-
72% have their resources distributed across five or more physical locations, with 43% distributed across 11 or more locations.
To say IT environments have gotten unwieldy is an understatement. King Mino’s labyrinth would be an easier slog than the average data center today. No wonder so many organizations find themselves in an unending cycle of reactive IT monitoring. They are constantly fighting fires to keep the lights on and pushing pro-active initiatives to the back burner. Unfortunately and unsurprisingly, that approach is just NOT working! The Forrester study reveals that 34% of the organizations surveyed are experiencing availability and performance issues every day, and each issue, on average, necessitated four-to-six professionals to spend over an hour to determine the cause of that problem.
Not a Millisecond to Lose
Our highly competitive business climate has gotten less and less forgiving of missteps. Not only do you need a better toolset to put out these fires, you want to put practices in place to prevent those fires from starting or spreading. You need techniques that help you anticipate specific issues and then quickly figure out the root cause of a problem before that problem burns your customers and causes them to lose trust and move elsewhere.
Today’s users are of a different mindset, forget disruption, they are not even tolerant of degradation. A decade ago you could probably get away with small amounts of downtime or page load lag but not anymore. Back in March 2012, The New York Times printed an article about website page-load times:
-
Remember when you were willing to wait a few seconds for a computer to respond to a click on a Web site or a tap on a keyboard? These days, even 400 milliseconds — literally the blink of an eye — is too long, as Google engineers have discovered. That barely perceptible delay causes people to search less.
-
People will visit a Web site less often if it is slower than a close competitor by more than 250 milliseconds (a millisecond is a thousandth of a second) [emphasis mine].
Website speed is a differentiator, and you can’t afford to have a lag in service if you want to have a shot at, let alone keep, a customer.
As Chuck Priddy, Zenoss’ senior product manager put it:
-
You risk losing sales because customer expectations about how long they’re willing to wait before going to a competitor’s website is now a matter of seconds, if that. You get one shot, and if you blow it because, say, your website stalled out, they’re not going to come back. And if you show up the next time they search for something, they’re thinking, “That site is slow and buggy. I’m not going to waste my time.” You’ve lost them forever.
In the “good old” days of brick and mortar, you hired friendly sales staff and created an overall ambience that made your buyers feel good. They bought that TV because they always bought their electronics from Ann, the ever-smiling sales person who always got them the “best deal”. And if Ann wasn’t there on Monday, you came back on Tuesday. Now with 15 different websites selling the same TV and all at the “best deal”, you need to recreate “Ann”. For many retailers, the Ann is their website, it is what they experience when navigating through your website. It better be up all the time and it better be “smiling”. Taking a reactive approach to operations monitoring just won’t cut it anymore.
Preventive vs. Reactive Operations
Although no one, not even Amazon, can prevent all outages or performance issues, you can do a lot to minimize them. If your customers cannot access the information they are seeking, when they are seeking, they are gone. It really does not matter to them that your IT infrastructure is very complex and that troubleshooting can take time. Or that you have to pull in 6 SMEs to fix an issue. Your reaction time has already cost you that customer, today and maybe forever.
It makes more sense to take preventive measures whenever you can.
Said Chuck:
-
You can pay the mechanic now or later. You can change your oil on time, which is a pain and will cost you $60-$70 to do, but contrast that with my engine throwing a rod and costing $2,000 for a repair.
You’re always going to have components fail that you don’t know about, but that’s why you incorporate redundancies into your architecture. But as Chuck said:
-
It’s much better to say, “You know what? On average, my disk drives, according to the manufacturer and based on our own experience, tend to start falling apart after 10 months.” So it makes sense to replace them after eight months.
You of course identified the 10 month failure rate by mining your historical data basically conducting analytics. Identifying trends and patterns are a critical part of moving your IT Ops from a reactive to a preventive mode.
So how can you spot events of significant in a timely manner so you can actually do something about it? What can your IT operations data tell you that can improve such critical tasks like capacity planning and infrastructure optimization?
That will be the topic of my next post, so stay tuned!