ESXi host CPU usage alarms and VM performance degradation in VDI environments due to vCPU overprovisioning
search cancel

ESXi host CPU usage alarms and VM performance degradation in VDI environments due to vCPU overprovisioning

book

Article ID: 421543

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

 

You see "Host CPU usage" alarms in vCenter on ESXi hosts running VDI workloads, particularly during peak user activity.

  • Virtual machines on affected hosts experience performance degradation and slow response times
  • Host CPU usage exceeds 75% (warning) or 90% (critical) thresholds sustained for 5 minutes or more
  • The total vCPUs allocated to powered-on VMs significantly exceeds the host's physical CPU capacity—a condition known as vCPU overprovisioning
  • VDI environments are especially susceptible because golden images with fixed vCPU counts are cloned across many VMs, amplifying any overallocation
  • The cumulative demand creates CPU scheduling contention where VMs must wait for available physical CPU cycles
  • This results in elevated CPU Ready times and degraded application performance

Additional symptoms reported:

  • ESXi hosts high CPU and memory utilization
  • Host utilization too high when customers are using VDI
  • Lowered CPU and memory on Golden Images but host utilization still high
  • All ESXi hosts in cluster reached 100% CPU utilization
  • Performance of virtual machines on affected hosts impacted

Environment

 

  • VMware ESXi 7.0 and later
  • Any VDI environment (VMware Horizon, Citrix Virtual Apps and Desktops, or other VDI solutions)

Cause

ESXi hosts allocate physical CPU cycles to virtual machines on demand. Under normal conditions, not all VMs need CPU resources simultaneously, so hosts can support more vCPUs than physical CPUs—this is expected behavior called overprovisioning or overcommitment.

VDI environments typically tolerate higher overprovisioning ratios than traditional server workloads because:

  • ESXi resource sharing features (transparent page sharing, memory ballooning, CPU scheduler optimizations) efficiently manage many similar VMs
  • Desktop applications tend to be lightweight and bursty rather than sustained high-CPU workloads
  • Individual VDI sessions spend significant time idle or waiting on user input

However, there is still a limit. When the total vCPUs allocated to powered-on VMs greatly exceeds the host's physical CPU capacity, VMs must wait in a queue for available CPU cycles. This waiting time is measured as "CPU Ready" in esxtop. As overprovisioning increases, CPU Ready times rise and VM performance degrades.

VDI environments can also be particularly susceptible to hitting this limit because:

  • Golden images define a fixed vCPU count that propagates to every cloned VM
  • A small overallocation per VM multiplies across dozens or hundreds of desktops per host
  • VDI usage patterns create correlated demand—many users logging in, launching applications, or running updates simultaneously

Field experience and internal testing suggest that hosts become unstable when vCPU overprovisioning approaches 1000% of physical capacity. As a rule of thumb, environments operating at 400% or higher are at increased risk of performance degradation and potential instability under load.

Resolution

Determination of the issue:

Step 1: Calculate your host-level vCPU overprovisioning ratio

You can determine your overprovisioning ratio using either method:

Manual calculation in vSphere Client:

  1. Select the ESXi host experiencing CPU alarms.
  2. Navigate to the VMs tab and count total vCPUs assigned to powered-on VMs.
  3. Navigate to Summary and note the number of logical processors (pCPUs).
  4. Calculate: (Total vCPUs ÷ pCPUs) × 100 = Overprovisioning %

Using PowerCLI or RVTools:

  • PowerCLI: Query Get-VM and Get-VMHost to extract vCPU and pCPU counts programmatically.
  • RVTools: Export the vCPU tab and vHost tab to calculate ratios per host.

Adjust for Hyper-Threading if enabled:

If Hyper-Threading (HT) is enabled, apply a 25% capacity adjustment rather than counting each logical processor as a full CPU. Hyper-Threading allows two threads to share a single physical core, but they compete for the same execution resources (ALUs, cache, branch predictors). The second thread does not provide a full core's worth of processing—benchmarks typically show 20-30% improvement, not 100%.

To calculate HT-adjusted effective cores: Effective Cores = Physical Cores × 1.25

For example, a host with 24 physical cores and HT enabled has 48 logical processors, but effective capacity is closer to 30 cores (24 × 1.25).

Then calculate: (Total vCPUs ÷ Effective Cores) × 100 = HT-Adjusted Overprovisioning %

Interpreting results:

  • Ratios approaching or exceeding 400% (HT-adjusted) indicate increased risk of CPU contention
  • VDI environments may tolerate somewhat higher ratios than server workloads, but performance degradation becomes likely beyond this threshold

Step 2: Collect esxtop batch data during high activity

Capture CPU metrics during a period of high VDI user activity (login storms, peak usage hours).

  1. SSH to the affected ESXi host.
  2. Run the following command to capture 20 minutes of data at 5-second intervals:
    esxtop -b -d 5 -n 240 > /tmp/esxtop_output.csv
    
  3. Retrieve the output file from /tmp/esxtop_output.csv.

Step 3: Analyze CPU Ready and Co-Stop metrics

Review the esxtop batch output for these key indicators:

  • CPU Ready % (per VM): Values averaging above 5% indicate VMs are waiting for physical CPU cycles. This confirms host-level contention.
  • Co-Stop % (per VM): High Co-Stop indicates an individual VM has more vCPUs than it can effectively use. Low Co-Stop combined with high Ready indicates the problem is cumulative host overprovisioning, not individual VM sizing.

Interpretation:

CPU Ready Co-Stop Indicates
High (>5%) Low (<3%) Host-level vCPU overprovisioning
High (>5%) High (>3%) Individual VMs may be oversized
Low (<5%) Low (<3%) No significant CPU contention

Step 4: Generate ESXi host log bundle

Immediately after capturing esxtop data, generate a log bundle from the affected host to correlate logs with the captured metrics.

Follow Collecting diagnostic information for VMware products


Step 5: For expert analysis, open a support case

If you require assistance interpreting results, open a support case with Broadcom for VM management.

Include the following:

  • esxtop batch output file (CSV)
  • ESXi host log bundle (generated immediately after esxtop capture)
  • Description of business impact of the issue, when high activity occurs, and symptoms observed

Option A: Right-size virtual machines

Reduce vCPU counts on VMs that are allocated more than they require.

  1. Use Aria Operations (formerly vRealize Operations) to identify oversized VMs and obtain right-sizing recommendations. See Using Rightsize to Adjust Resource Allocation.
  2. If Aria Operations is not available, use other VM right-sizing analysis methods or consult application vendors for guidance on virtual machine resource requirements—virtualized workloads often require different sizing than physical server deployments.
  3. Power off or restart VMs as required to apply vCPU reductions.
  4. Monitor host CPU utilization and VM performance after changes.

Option B: Distribute workload across hosts

Migrate VMs to other hosts in the cluster to reduce per-host overprovisioning ratios.

  1. Review vCPU overprovisioning ratios on all hosts in the cluster.
  2. Identify hosts with lower ratios that can accept additional VMs.
  3. Use vMotion to migrate VMs from overloaded hosts to less utilized hosts.
  4. If DRS is enabled, review DRS automation level and migration threshold settings to improve automatic load balancing.

Option C: Add physical CPU capacity

If right-sizing and redistribution are insufficient, add physical resources.

  1. Add additional ESXi hosts to the cluster to distribute VDI workloads. See Adding Hosts to a Cluster.
  2. Alternatively, upgrade existing hosts with processors that have higher core counts.
  3. Recalculate overprovisioning ratios after capacity is added to confirm improvement.

Additional Information