In some recent conversations with customers around the different types of key performance indicators, or KPIs, they are interested in monitoring, the topic of “stolen CPU” came up several times. Based on these recent discussions, we thought we would share a bit more information with you about this whole idea of stolen CPU, as well as how you can monitor for stolen CPU .
What is “Stolen CPU”?
CPU gets “stolen” when a virtual machine (VM) is forced to sit around and wait for “real” CPU resources. This happens because the VM host machine has already allocated CPU resources to other tasks – for example, off to another VM.
In an ideal world, your percentage of CPU ”steal time” should be zero. Anything above zero means that there is some performance degradation, typically caused by one of two things:
- You didn’t assign enough CPU resources to the VM in question
- Your physical server is “oversubscribed,” and the VMs are all competing for scarce CPU resources.
Figuring Out What’s “Stealing” Your CPU
On Linux hosts, you can easily track Stolen CPU via the top command, where it is listed as %st
. If this metric is above 0% for any length of time, it’s something that you should check out.
When you dig a little deeper, you’ll probably find that in most cases the root cause is that your Linux host is oversubscribed. For example, assume you have a machine with 32 physical CPU cores running 20 VMs, and each one has been allocated two virtual CPUs. This means 40 virtual CPUs are competing for 32 physical CPUs -- creating a prime environment for “stolen CPU”.
Unfortunately, just looking at the %st metric can’t really tell you if oversubscription is really your culprit. How you go about finding the cause is going to depend on what platform you’re running.
With XenServer or VMware, you have to track down the VM, find out which host that VM is assigned to, and then look at the CPU utilization of the host. You’ll have to do even a bit more investigation if the VM is part of a cluster using DRS that should be automatically moving VMs that need more resources.
If you’re looking at a high steal time in an operating system running in AWS, you’ll need to correlate the OS steal time with the information available in AWS CloudWatch. For example, the performance degradation you see could be a result of a CPU quota enforcement, or it could be because other tenants on the same hardware are requesting more CPU resources than they should.
Getting Your “Stolen” CPU Back
If you are like a lot of the folks we talk to, when you first start looking at how you want to resolve any of the steal time issues you find, your first instinctive thought is to assign more CPU resources to the VM and see what happens. If it doesn’t work, it’s likely due to the server being oversubscribed, and so you try moving the VM to a different host.
For VMware environments, however, this isn’t necessarily the best approach.
If you assign two CPU cores to an application running on a VMware VM, the VM can’t execute an instruction until two cores are available. If you’ve assigned four cores, it’s going to have to wait until four cores are available. That means – very counterintuitively – that if you REDUCE the number of cores assigned to a VMware VM, it might actually dramatically reduce your steal time. This is because if you only have one core assigned, only one core needs to be available for the instructions to run on that VM.
Adding “CPU Steal Time” to Your Monitoring Process
If you’re looking for a more automated way to add the “CPU steal time” to your overall Linux server monitoring regimen, Zenoss can help. Just go into the monitoring template for Linux servers and activate the %st
counter. Then, if Zenoss detects high steal time, you can drill down directly in the Zenoss console to see which host a VM is running on and immediately discover the CPU utilization. With this information, you’ll be able to determine the best way to get your “stolen CPU” back.
For more information about how you can use Zenoss to monitor for stolen CPU as a part of your monitoring process, see the step-by-step instructions in the following article on the Zenoss Wiki: “How Do I Monitor for Stolen CPU on Linux Servers? .
Spread the Word!
If you've found this article helpful, feel free to share it with others via LinkedIn, Twitter, Google+ or Facebook, or follow our blog to get the latest news and information from Zenoss.
If you are new to Zenoss and would like to learn more, check out one of the following resources:
- See how some of our customers are using Zenoss successfully in their environment: Read the Zenoss Service Dynamics: 4 Profiles in Unified Monitoring Successs white paper
- Learn more aobut the Zenoss Service Dynamics architecture: Zenoss Service Dynamics Architecture Overview