Understanding CPU Steal Time and Optimizing Gemfire Performance in Virtualized Environments

Products

Gemfire Pivotal GemFire VMware Tanzu Gemfire

Issue/Introduction

A large portion of performance issues with Gemfire arise in virtualized or container-based deployments where hardware resources are being overcommitted. Gemfire has the ability to detect this overcommitment by using a Linux system metric commonly called "CPU steal time". CPU steal time measures the time a CPU is ready to execute tasks but processing is delayed because the underlying physical CPU is busy handling other tasks. Put another way, it's the time a virtual CPU (vCPU) spends waiting for a physical CPU for processing time.

In some environments, monitoring steal time is not enabled by default, making it difficult to diagnose and troubleshoot performance issues. Below we will discuss more about this concept and how to address the challenge of observability and performance in scenarios like this.

Environment

VMware VSphere

VMware ESXi

Cause

In VMware VSphere, cpu steal time is known as "ready time", which similarly to "steal time" measures the period during which a virtual machine (VM) is ready to run but cannot because the underlying physical CPU is occupied with processing other tasks. VSphere's ready time, usually reported in milliseconds, is particularly useful for understanding CPU contention. It provides a clear measure of how long the VM running Gemfire is waiting for CPU access to complete small portions of work. Since Gemfire aims for sub-millisecond response times, any significant amount of ready time indicates the system is spending more time waiting for CPU resources than actively processing tasks.

Gemfire records "/proc/stat" steal time in percentages. To give this some perspective, think of a busy grocery store where there are several checkout lanes, of which only a few are open so you have to wait for your turn. Steal time percentage represents the percentage of time your checkout lane is occupied by someone else. The higher the steal time, the longer you wait for resources.

This is of course an oversimplification of the actual process. In reality, scheduling is much more dynamic and is designed to be fair so that every task receives some "cashier time", or CPU time. But imagine that for each item in each customer's cart the cashier rings up one item, saves that progress, then moves to the next customer's item. This process, known as "context switching", allows the CPU to quickly switch between tasks, ensuring all processes receive some attention and prevents any single process from completely monopolizing the CPU. However, in situations like this, if the queue is too long it can feel like there's no progress; causing something that should take sub-milliseconds to process to now take seconds instead.

Resolution

The ideal steal time value depends on the application's SLA (service level agreement), which you establish through monitoring. Generally speaking, the lower the steal time, the better. Ideally, aiming for a zero steal time value would be best and would help prevent applications from failing to meet their SLAs, however, achieving a true zero steal time isn't always realistic.

Here are some guidelines for acceptable steal times in most uses cases:

Below 2%: Ideal - Resulting in minimal impact on application latency and generally acceptable performance.
Between 2% and 5%: Moderate impact - Occasional slowdowns may occur, particularly during peak usage periods. Teams should consider reducing overcommitment or redistributing workloads to improve this behavior.
Above 5%: High impact - Noticeable application latency and possible failures as SLAs and timeouts trigger retries, which can worsen the issue if the abandoned workload is still running and new retries add additional load.
Above 10%: Critical impact - Severe performance issues likely requiring immediate intervention.

Most cloud providers have steal time reporting enabled by default, which simplifies CPU contention monitoring. However, in private cloud environments, such as those running on VMware VSphere, this feature is not enabled automatically. In order to monitor steal time in VSphere teams will need to configure the VM with the below property:

stealclock.enable = "TRUE"

Enabling the above setting ensures accurate tracking of CPU resources and helps to identify and resolve performance issues caused by overcommitted hardware.

In addition to this, the most effective way to reduce or eliminate steal time is to ensure a 1-to-1 mapping of vCPUs to physical CPU cores. That is, for every vCPU assigned to a VM there should be a dedicated physical CPU core available. By aligning vCPUs to physical cores on a 1-to-1 basis, you prevent situations where multiple VMs are competing for the same physical CPU resource; which is the primary cause of CPU steal time.

In performance critical situations, it's also recommended to reserve a core specifically for the hypervisor; such as in VMware ESXi. The hypervisor manages the allocation of physical resources to VMs and by reserving a dedicated core for the hypervisor you ensure it has the necessary capacity to efficiently handle scheduling and other management tasks, without being affected by the workload of the VM itself. This reserved hypervisor core will help maintain a consistent performance across all VMs running on the host.

Following these guidelines by enabling steal time monitoring and reserving sufficient capacity for your database will ensure that Gemfire has all the resources it needs to perform optimally. This proactive approach can also significantly reduce the time spent in troubleshooting and root causing should issues arise, and will certainly help teams avoid the frustration of identifying why a database isn't meeting performance expectations. Prioritizing observability and proper resource allocations will also allow your applications to consistently meet their SLAs and deliver reliable, high-performance results. Monitoring cpu steal time even in production environments is the recommendation from Gemfire Engineering.