As IT moves away from monitoring individual devices and moves towards monitoring Services we need to rethink how we visualize the relationships between the components of our Services. Everyone has seen the architecture diagram for a standard 3 tier web application. It will have a layer for the Web Servers on top of a layer for the Application Servers and a layer for the Database Servers. There will be lines showing the various networks and a big cloud at the very top showing the Inter or Intranet. This type diagram was great for displaying the connectivity between the physical layers or in what order a user might hit them. What it does not show is how the Service is dependent on its various components and how those components are actually dependent on each other for their own state.
Let's take a look at our standard 3 tier web application - Web, App & Database - and re-visualize it from the perspective of the Service. For our exercise let's say that the Service we are redrawing is our externally facing Order Processing System - OPS. We will keep it simple, we have a handful of webservers - a couple of physical boxes we have not decommissioned yet and some virtual ones all running Apache. There are some NFS volumes from a couple of different filers for storage. The App layer has all been virtualized, is running Tomcat and has a similar setup for storage. At the Database Layer we have a couple of physical servers with a single Database clustered across the servers and some SAN storage. Oversimplified? Yes, but it will work to illustrate how we need to think about monitoring our services.
The first question we have to ask ourselves is "What do we really care about?” The answer for any business should be "My Customers!” What do your customers care about? They care about being able to reach your Order Processing System (Availability) and that it is responsive when they use it (Performance).With that thought in mind, at the top of our diagram is our OPS Service. The Service is the ultimate Parent.
From here let’s work through our simple 3 layer model. Let’s start at the top of our old diagram with Web Servers. If we think about it, what do these Web Servers represent? They represent a Service themselves providing the Web Interface into your OPS Service. If something affects our Web Service does it does it affect our OPS Service? You bet it does! Our Web Service is now a child of the OPS service. What makes up our Web Service? All of our Web Servers, physical or virtual, are children of Web Service. What happens to them will affect the Web Service. Any other physical components - like storage, ESX hosts or Blades/Chassis should all be considered children of the servers they support. If a component supports multiple devices - two VMs are running in the same chassis, for example - that chassis would be a child of both of those VMs.
Got it? Good, that was the easy part. Now we are going to step out of our traditional comfort zone.
Let's look at our App layer. If you guessed that I was going to tell you that this layer too should be considered a separate service with all of the servers’ children of the service and underlying components children of the servers, you guessed right. If you thought that I was going to tell you that it was a child of the Web Service, you guessed wrong. It isn't, the Application Service is a child of our OPS Service and will sit right next to the Web Service in our Diagram. Remember, we are looking at this from the perspective of our Service. If the Application Service is broken will it affect our Web Service? Sort of, it will in that the Web Layer won't have the content to serve but, the Web Service itself is not down and it will still function as you would expect. The Application Service will affect our Parent OPS Service. Remember, from the service perspective we care about is what our customers care about. When they use our service they don't care if they see a 404 or a 500 error, all they care about is the fact that they can't do what they want. When the Application Service is not in a good state, the OPS Service is not in a good state either.
I hope I didn't lose you there; this really is a new way of visualizing your infrastructure.
As we look our last layer you all should be able to tell me what we are going to do with it. Your instinct is right, there will be a Database Service and it will sit right next to the Web & Application Services as a child of our OPS Service. It will contain the Physical Servers as its children and all the additional subcomponents as children of the servers, just like the other two Services.
You have to look at your Service from the top down but manage it from the bottom up. You may have heard this analogy before but I really like it so I will use it again. A parent is only as happy as their least happy child. If you have children, you know this is true. We are no long looking at Parent -> Child relationships in IT. We are looking at Child -> Parent relationships. If you were questioning what we did in our exercise, take look at your Services from this perspective and it should make more sense.
When you start monitoring what really matters, you have to reassess how you look at your infrastructure. Your customers do not care if xyzServer is down or if Apache is throwing errors. They care about Availability and Performance. How many have you have had that call from a customer saying that the app they use is slow or a flood of 100s of events and you aren't sure where to start. With Tomcat and our DB only children of our OPS service you would not need to waste time investigating Apache when the issue is actually with Tomcat.
On the flipside, in large or dynamic environments, you may not know what serverXYZ actually supports. Have you ever been the on call person and gotten a page on a server you never heard of? I have, someone on my Team put a box into production one afternoon and it crashed hard at 3AM. I had no idea what it did. I had to dig through 3 separate systems to figure out what applications were supposed to be running, who owned them and what Business Service it supported. If I had been monitoring Services instead of Servers I would have had most, if not all, of my answers right away.
Obviously this was an oversimplification of a complex problem. Most Services are much more complex today. They have pieces in multiple data centers, hosted, in the cloud, multiple databases, they make calls out to 3rd party SaaS tools etc. You know, you work with them every day. When you start monitoring your Services, take a good look at the relationship of the various components. Look at how they truly interact with each other, how they affect each other and how each component affects your customer when something is wrong.