CPU Ready Time Issues in ESXi Environments Running SQL Server VMs

Products

VMware vCenter Server

Issue/Introduction

High CPU ready times and performance degradation in SQL Server virtual machines, particularly in environments with CPU overprovisioning. This article explains how to identify when CPU overprovisioning becomes problematic and how to address these issues.

Environment

- VMware ESXi 6.x and later
- Virtual machines running Microsoft SQL Server
- Environments with multiple high I/O workload VMs
- Systems showing CPU ready time symptoms

- High CPU ready times (>5%)
- Elevated CPU co-stop events
- Increased CPU latency
- Degraded SQL query performance
- Inconsistent application response times

Cause

While ESXi hosts can support CPU overprovisioning in many scenarios, high I/O workloads like SQL Server may experience performance degradation when:
- Multiple resource-intensive VMs compete for CPU resources
- vCPU allocation significantly exceeds physical CPU capacity
- Concurrent high-demand workloads create CPU scheduling conflicts
- CPU ready times consistently exceed recommended thresholds

Resolution

Gather performance data using esxtop:
1. Create a directory for data collection:
  
  mkdir /vmfs/volumes/datastore_name/esxtop
  cd /vmfs/volumes/datastore_name/esxtop
2. Run batch capture for 15 minutes with 15-second intervals:
  
  minutes=15
  path="/vmfs/volumes/datastore_name/esxtop"
  esxtop -ba -d 15 -n $(expr ${minutes} \* 30) > "${path}"/$(hostname)_$(date -u +"%Y-%m-%dT%H%M%S")_esxtop_batch_all.csv
Analyze key performance metrics*:
1. CPU Ready Time:
  - Baseline threshold: 5%
  - Warning threshold: 10%
  - Example analysis:
    * Average: 7.49% (indicates resource constraint)
    * Peak: 27.32% (severe contention)
    * Percentage of intervals >10%: ~20% (persistent issue)
2. CPU Co-Stop Events:
  - Baseline threshold: <1%
  - Warning threshold: 5%
  - Example analysis:
    * Average: 0.93%
    * Peak: 52.45% (severe scheduling conflicts)
    * Pattern: Spikes during high workload periods
3. CPU Latency:
  - Baseline threshold: 5%
  - Warning threshold: 10%
  - Example analysis:
    * Average: 12.06% (consistent resource constraints)
    * Peak: 72.35% (critical contention)
    * Impact: Direct correlation with performance
Implement remediation based on findings:
1. For severe CPU ready times (>10% sustained):
  - Migrate resource-intensive VMs to different hosts
  - Reduce vCPU count if possible
  - Implement CPU reservations for critical workloads
2. For high Co-Stop events (>5% sustained):
  - Review vCPU to pCPU ratio
  - Adjust CPU shares for priority workloads
  - Consider wider workload distribution
3. For elevated CPU latency (>10% sustained):
  - Evaluate host capacity
  - Review workload scheduling
  - Assess resource allocation strategy
Validate improvements:
1. Repeat esxtop capture after changes
2. Compare before/after metrics
3. Monitor application performance

d. Document effectiveness of changes

Notes:

* If needed, open a case with VMware at Broadcom to assist with the esxtop output analysis

The example metrics provided are representative of a system experiencing significant CPU contention. The actual values may vary, but the same analysis methodology applies.