CPU Ready Time issues in ESXi environments running high I/O workloads
search cancel

CPU Ready Time issues in ESXi environments running high I/O workloads

book

Article ID: 386180

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

High CPU ready times and performance degradation in high I/O workloads VMs like SQL server, particularly in environments with CPU overprovisioning. This article explains how to identify when CPU overprovisioning becomes problematic and how to address these issues.

Environment

- VMware ESXi 6.x and later
- Virtual machines running high I/O workload like Microsoft SQL Server.
- Environments with multiple high I/O workload VMs.
- Systems showing high CPU ready times (>5%).
- Elevated CPU co-stop events.
- Increased CPU latency.
- Degraded SQL query performance.
- Inconsistent application response times.

Cause

While ESXi hosts can support CPU overprovisioning in many scenarios, high I/O workloads like SQL server may experience performance degradation when:
- Multiple resource-intensive VMs compete for CPU resources.
- vCPU allocation significantly exceeds physical CPU capacity.
- Concurrent high-demand workloads create CPU scheduling conflicts.
- CPU ready times consistently exceed recommended thresholds.

Resolution

  1. Gather performance data using esxtop:
    1. Create a directory for data collection:

      mkdir /vmfs/volumes/datastore_name/esxtop
      cd /vmfs/volumes/datastore_name/esxtop

    2. Run batch capture for 15 minutes with 15-second intervals:

      minutes=15
      path="/vmfs/volumes/datastore_name/esxtop"
      esxtop -ba -d 15 -n $(expr ${minutes} \* 30) > "${path}"/$(hostname)_$(date -u +"%Y-%m-%dT%H%M%S")_esxtop_batch_all.csv

  2. Analyze key performance metrics*:
    1. CPU Ready Time:
      • Baseline threshold: 5%
      • Warning threshold: 10%
      • Example analysis:
        * Average: 7.49% (indicates resource constraint)
        * Peak: 27.32% (severe contention)
        * Percentage of intervals >10%: ~20% (persistent issue)

    2. CPU Co-Stop Events:
      • Baseline threshold: <1%
      • Warning threshold: 5%
      • Example analysis:
        * Average: 0.93%
        * Peak: 52.45% (severe scheduling conflicts)
        * Pattern: Spikes during high workload periods

    3. CPU Latency:
      • Baseline threshold: 5%
      • Warning threshold: 10%
      • Example analysis:
        * Average: 12.06% (consistent resource constraints)
        * Peak: 72.35% (critical contention)
        * Impact: Direct correlation with performance

  3. Implement remediation based on findings:
    1. For severe CPU ready times (>10% sustained):
      • Migrate resource-intensive VMs to different hosts
      • Reduce vCPU count if possible
      • Implement CPU reservations for critical workloads

    2. For high Co-Stop events (>5% sustained):
      • Review vCPU to pCPU ratio
      • Adjust CPU shares for priority workloads
      • Consider wider workload distribution

    3. For elevated CPU latency (>10% sustained):
      • Evaluate host capacity
      • Review workload scheduling
      • Assess resource allocation strategy

  4. Validate improvements:
    1. Repeat esxtop capture after changes
    2. Compare before/after metrics
    3. Monitor application performance

   d. Document effectiveness of changes

Notes:

* If needed, open a case with VMware at Broadcom to assist with the 'esxtop' output analysis

The example metrics provided are representative of a system experiencing significant CPU contention. The actual values may vary, but the same analysis methodology applies.

 

Additional Information