Many Zenoss customers are using Amazon Web Services (AWS) to support a good chunk of their production operations. Zenoss is, too.
In this post, we’ll share with you some of the lessons we’ve learned in how to deliver highly reliable application services hosted on Amazon-supplied resources.
A Wealth of Experience Operating Amazon Instances
Zenoss extensively uses Amazon resources to support customers and develop code.
The servers hosting our Zenoss-as-a-Service (ZaaS) customers all run on AWS instances. Customers get round-the-clock monitoring managed by Zenoss staff, and we use a dedicated Zenoss instance to monitor our ZaaS hosts.
Our engineering team typically has several hundred AWS instances running to enable developers and staff to develop against private servers. During scale testing, we can spin up many, many more instances.
Most Frequent Service Problems
With all this experience, we’ve learned that service problems fall into four big categories:
- Application failures
- General failure to respond
- Guest OS issues
- Amazon infrastructure
Here’s how we set up Zenoss Service Dynamics to assure that we are immediately aware of issues.
Define an Impact Service for the Application
We’ll use a very simple example application — a single Web server that pulls some images from an S3 repository.
Of course, we already have the Web server guest OS and the Amazon infrastructure discovered and monitored. But targeting the potential service issues takes just a bit more work.
First of all, for a Web server, our key questions are:
- Will DNS resolve
- Is the Web server serving pages?
- Is it serving pages quickly?
To check all of these, set up a synthetic HTTP transaction device using the DNS name for the server. Our example will show www.zenoss.com. We use the HTTP test very commonly, as nearly all of our servers communicate using the Web. Make sure you use the DNS name and not the Amazon public IP name — we really need to see if name resolution is failing!
Next, build an impact service. I’ve called mine “AWS” for this example. There are three important components of this service: the guest OS, the S3 bucket, and the HTTP synthetic transaction device. Here’s what that looks like after it’s been created:
Zenoss Dynamically (and Continuously) Discovers Dependencies
Maybe you’ve heard a Zenoss employee talking about “the model,” and certainly you’ve watched a device being modeled by the software. What’s the model, anyway, and why is it important in assuring service?
The diagram below represents the Zenoss model’s view of this application. At the top is my AWS application, and it is connected to three objects – the S3 bucket (labeled testing.zenoss.io), the HTTP synthetic transaction (labeled www.zenoss.com), and the Windows 20008 Web server (labeled w20008.mem.sola…).
The rest of the boxes represent objects that those three components depend on.
- The S3 bucket is available as long as our AWS account is active.
- The HTTP transaction has no external dependencies.
- The Windows server relies on both Windows resources — the RedHat PV driver, the NIC, and the C: and D: drives — and AWS resources, starting with the sol2008 instance.
- The Amazon instance relies on the AWS volumes, a Solutions Subnet, and is running as part of a virtual private cloud in the Amazon East Region.
All those relationships are built automatically in the Zenoss model. You don’t need to do any configuration activity to maintain them. Now, that’s cool!
In each box in the diagram next to the icon representing the impact element type, there’s a status indicator. Most of the boxes here have a green up arrow. These elements have no reported issues. They’re available and performing. If there’s a problem with this application, we can move elsewhere.
This application model helps us by checking for all four of the common issues. Here’s how:
[table width ="100%" style ="table-bordered" responsive ="true"]
[table_body]
[table_row]
[row_column]Application Failures[/row_column]
[row_column]Our HTTP transaction tests to make sure DNS is working, pages are being served, and the response time is good. If there’s an issue, the www.zenoss.com box changes from a green arrow to something else.[/row_column]
[/table_row]
[table_row]
[row_column]General Failure to Respond[/row_column]
[row_column]We occasionally see that the instance is in a running state, but the server doesn’t serve pages or even respond to pings. The native availability check built into the Windows server template alerts us to this, changing the win2008mem.sola… state away from green.[/row_column]
[/table_row]
[table_row]
[row_column]Guest OS Issues[/row_column]
[row_column]When an instance is too small for its resource needs, or if there is a critical Windows event, the performance thresholds and event log processing in the Windows server template will change the state of the win2008mem guest OS service element.[/row_column]
[/table_row]
[table_row]
[row_column]Amazon Infrastructure[/row_column]
[row_column]Finally, if Amazon sends a fault event for any of its infrastructure components or (unlikely) has a problem affecting the full region, we see that issue reflected in the element’s state.[/row_column]
[/table_row]
[/table_body]
[/table]
The Root Cause of an Application Problem
With the impact service defined for the application, Zenoss is now checking for all four issues we’ve learned to watch out for.
For me, the best part of this, the part that is unique to Zenoss, is that it notifies me of a problem with an application, not some random device failure. When a device fails, I don’t know if it’s critical or just a test device. It’s much more useful to know which application has an issue.
If it’s a critical application, I can focus resources on fixing it quickly. If it’s an individual test server, well, I will probably let the engineer deal with that one. Without knowing the application dependencies, it’s very hard to tell Linux servers part or know whether one of the hundreds of Amazon instances can be safely turned off. The Zenoss model really helps here.
Here’s how Zenoss Service Dynamics reports a pretty simple problem affecting this application. Our Web server is no longer responding. We know this right away with Zenoss — no need to wait for a customer to call us, and so we can respond immediately.
Service Dynamics has also reported a root cause — here, the Web server isn’t sending any data back. What we already know is that DNS works, Amazon resources are running, and the Windows guest OS is running, and that the IIS service hasn’t stopped on that server — all because there are no problems reported. We can focus immediately on resolving the single problem at the root here.
What Zenoss Service Dynamics has done is shortcut what might have been a fairly lengthy discovery process involving several people to point us right at the issue. It may not sound like much, but our customers are telling us that their problem identification time is reduced dramatically, and they can effectively watch more applications with fewer people.
That means happier customers and an increased focus on business results. That’s the way to make a CIO happy!