Yesterday I wrote the blog post, trying to figure out what is the CPU steal time and why it occurs. The problem with that post was that I didn’t go deep enough.
I was looking at this issue from the point of view of a generic virtual machine. The case that I had to deal with wasn’t exactly like that. I saw the CPU steal time on the Amazon EC2 instance. Assuming that these were just my neighbors acting up or Amazon having a temporary hardware issue was a wrong conclusion.
That’s because I didn’t know enough about Amazon EC2. Well, I’ve learned a bunch since then, so here’s what I found.
It all started with this ServerFault answer, saying:
Depending on the EC2 instance type and the underlying hardware, you may not be paying for access to all of the underlying CPU cycles. Amazon is not going to give you access to 100% of a modern, fast CPU if you have asked for an m1.small which is promised to be equivalent to an old, slow CPU.
On EC2, steal doesn’t depend on the activity of other virtual machine neighbors. It is simply a matter of EC2 making sure you are not getting more CPU cycles than you are paying for.
Wait, what? I need to know more about this. So I dig deeper, and I find this PDF document describing “The Top 5 AWS EC2 Performance Problems”. Just for the record, they are:
- Unpredictable EBS Disk I/O
- EC2 Instance ECU Mismatch and Stolen CPU
- Running out of EC2 Instance Memory
- ELB Loading Balancing Traffic Latency
- AWS Maintenance and Service Interruptions
Hmm. It’s good that I haven’t hit all five (yet?). But the issue was having is clearly on the list. So, scrolling down to page 11 of that document, I find a very helpful explanation, of what is an ECU:
While AWS has a large number of physical servers under management it does not rent them per se, rather only access to these servers in the form of virtual machines is available. The types of virtual machines are limited to a small list so as to make choosing an instance relatively easy. Somehow a virtual machine type, e.g. m1.large, can run on very different underlying hardware platforms and yet yield roughly the same performance in terms of compute.
To standardize compute, AWS has created a logical computation unit known as an Elastic Compute Unit (ECU). ECUs equate to a certain amount of computing cycles in a way that is independent of the actual hardware – 1 ECU is defined as the compute power of a 1.0-1.2Ghz of a 2007 server CPU.
… and further on down on what is “stolen” CPU:
Stolen CPU is a metric that’s often looked at but can be hard to understand. It implies some malevolent intent from your virtual neighbors. In reality it is a relative measure of the cycles a CPU should have been able to run but could not due to the hypervisor diverting cycles away from the instance. From the point of view of your application, stolen CPU cycles are cycles that your application could have used.
Some of these diverted cycles stem from the hypervisor enforcing a quota based on the ECU you have purchased. In other cases, such as the one shown below, the amount of diverted or stolen CPU cycles varies over time, presumably due to other instances on the same physical hardware also requesting CPU cycles from the underlying hardware.
… and as to why it occurs:
Since the price per instance type is the same in a given region, AWS, through the use of hypervisors, ensures that all virtual machines get a fair share of access to the underlying hardware to actually run.
Recommended problem avoidance and resolution:
- Buy more powerful EC2 instances
- Baseline your application compute needs
- Briefly profile your app on an EC2 instance before finalizing a deploy decision
- Re-deploy your application in another instance
OK, understood. But I want more and detailed information from the source, so I went to read the Amazon EC2 details page. I’ve read it before, but it’s overwhelming and, often, difficult, with all the Amazon lingo, to make sense of. Gladly, this time it’s quite useful. I looked for my t2 instance type, and saw this:
General Purpose Instances
T2 instances are Burstable Performance Instances that provide a baseline level of CPU performance with the ability to burst above the baseline. Instances in this family are ideal for applications that don’t use the full CPU often or consistently, but occasionally need to burst (e.g. web servers, developer environments, and small databases).
- t2.micro: 1 GiB of memory, 1 vCPU, 6 CPU Credits/hour, EBS-only, 32 bit or 64-bit platform
- t2.small: 2 GiB of memory, 1 vCPU, 12 CPU Credits/hour, EBS-only, 32 bit or 64-bit platform
- t2.medium: 4 GiB of memory, 2 vCPUs, 24 CPU Credits/hour, EBS-only, 32 bit or 64-bit platform
- t2.large: 8 GiB of memory, 2 vCPUs, 36 CPU Credits/hour, EBS-only, 64 bit platform
Hold on! What’s that burstable thing? Have I seen it before? No, I haven’t. And it’s pure gold! It’s so good that I’ll quote the whole thing in here:
Burstable Performance Instances
Amazon EC2 allows you to choose between Fixed Performance Instances (e.g. M3, C3, and R3) and Burstable Performance Instances (e.g. T2). Burstable Performance Instances provide a baseline level of CPU performance with the ability to burst above the baseline. T2 instances are for workloads that don’t use the full CPU often or consistently, but occasionally need to burst.
T2 instances’ baseline performance and ability to burst are governed by CPU Credits. Each T2 instance receives CPU Credits continuously, the rate of which depends on the instance size. T2 instances accrue CPU Credits when they are idle, and use CPU credits when they are active. A CPU Credit provides the performance of a full CPU core for one minute.
For example, a t2.small instance receives credits continuously at a rate of 12 CPU Credits per hour. This capability provides baseline performance equivalent to 20% of a CPU core. If at any moment the instance does not need the credits it receives, it stores them in its CPU Credit balance for up to 24 hours. If and when your t2.small needs to burst to more than 20% of a core, it draws from its CPU Credit balance to handle this surge seamlessly. Over time, if you find your workload needs more CPU Credits than you have, or your instance does not maintain a positive CPU Credit balance, we recommend either a larger T2 size, such as the t2.medium, or a Fixed Performance Instance type.
Many applications such as web servers, developer environments and small databases don’t need consistently high levels of CPU, but benefit significantly from having full access to very fast CPUs when they need them. T2 instances are engineered specifically for these use cases. If you need consistently high CPU performance for applications such as video encoding, high volume websites or HPC applications, we recommend you use Fixed Performance Instances. T2 instances are designed to perform as if they have dedicated high speed Intel cores available when your application really needs CPU performance, while protecting you from the variable performance or other common side effects you might typically see from over-subscription in other environments.
OK. Now, all of a sudden, it all makes sense. My t2 instance was overusing the CPU, ran out of the CPU credits, and all the rest of the CPU time was stolen. It wasn’t the fault of the Amzon’s hardware or my neighbors acting up. As often, I rushed to blame other people for my own issues.
Armed with that knowledge, I have applied some configuration optimization to the application and the server running it and now things seem to be back to normal.