Critical CPU capacity alerts despite low utilization in vCenter for VCF Operations

Products

VCF Operations VMware Cloud Foundation VMware vCenter Server

Issue/Introduction

In VCF Operations 9.x, you may observe the following:

"Time Remaining" or "Capacity Remaining" for CPU (Demand) triggers critical alerts for a Cluster Compute Resource.
vCenter Server reports significantly lower raw CPU utilization than the VCF Operations capacity metrics.
Predictive analytics indicate the cluster is out of capacity, despite available physical resources.

Environment

VCF Operations 9.0.x
vCenter Server 9.0.x
Aria Operations 8.18.x

Cause

This discrepancy typically occurs due to one or more of the following factors:

Transient Demand Spikes: The Demand model accounts for historical utilization spikes that vCenter's instantaneous real-time metrics do not display. The predictive engine factors these historical peak workload spikes into the baseline, projecting a capacity shortfall even if current utilization is low.
vSphere HA Reservation: vSphere High Availability (HA) Admission Control is enabled. VCF Operations accurately reflects this and automatically deducts the HA reservation percentage from the Total Capacity to calculate "Usable Capacity."
Capacity Policy Configuration: The active policy may be configured with a "Conservative" Time Remaining Risk Level or an overly broad historical time window, capturing outdated workload spikes.

Resolution

To align the capacity projections and resolve the discrepancy, perform the following steps:

Part 1: Address transient demand spikes via policy adjustments If the alerts are caused by historical transient spikes, adjust the risk level and historical time window. Do not enable the Allocation model with arbitrary overcommit ratios to fix Demand-based alerts, as the engine bases alerts on the most constrained metric.

Log in to the VCF Operations UI.
Navigate to Infrastructure Operations > Configurations > Policy Definition.
Edit the active policy applied to the affected cluster.
Navigate to the Capacity section.
Adjust the Time Remaining Risk Level (e.g., shifting from Conservative to Aggressive) depending on your environment's requirements.
Under the Historical Data window settings, temporarily reduce the timeframe (e.g., from 30 days to 1 day) to flush historical spikes from the predictive model.
Save the policy.

**See Alternative Capacity Models in Additional Information below to determine the best model to use based on your specific environment

Reference: Allocation and Demand Model in Workload Optimization

Part 2: Verify vSphere HA Admission Control settings If vSphere HA Reservations are artificially constraining capacity, validate the settings directly within the vCenter Server cluster.

Log in to the vSphere Client.
Select the affected cluster and navigate to Configure > vSphere Availability.
Edit the vSphere HA settings and review the Admission Control configuration.
If the reservation percentage is unnecessarily high for the environment's current architecture, adjust it to a more appropriate value. VCF Operations reflects the updated usable capacity in the next collection cycle.

Part 3: Recalculate capacity analytics The capacity forecasting engine runs periodically (typically every 24 hours). The "Time Remaining" or "Capacity Remaining" metrics will not immediately reflect the new baseline.

To force an immediate recalculation globally, navigate to Administration > Control Panel > Dynamic Thresholds, click Start.
Alternatively, to reset the baseline for a specific cluster, navigate to the cluster in the inventory tree (Infrastructure Operations > Configurations > Inventory Management) or simply use Global Search at the top of the page and then select the Capacity tab, then use the Reset option within the Time Remaining pane.
- Note: Using the Reset option does not delete the historical metric data from the database; your data remains fully intact for historical reporting, charts, and dashboards. It simply resets the calculation starting point, instructing the predictive capacity engine to ignore data prior to the reset point and build a new forecasting model from that day forward.

Additional Information

**Alternative Capacity Models: If the environment requires capacity planning based on provisioned limits rather than historical utilization, administrators can switch to the Allocation model.

To enable the Allocation model and configure overcommit ratios, perform the following steps:

Log in to the VCF Operations UI.
Navigate to Infrastructure Operations > Configurations > Policy Definition.
Select and edit the active policy applied to the target clusters.
Navigate to the Capacity section within the policy editor.
Locate the Allocation Model settings and enable the checkboxes for CPU, Memory, and/or Disk Space depending on the resource boundaries you wish to enforce.
Define the operational overcommit ratios (e.g., setting a vCPU to pCPU ratio of 4:1) in the newly exposed fields to establish the provisioned capacity limits.
Save the policy changes.

Note: Enabling the Allocation model provides a secondary capacity boundary based on provisioned metrics. The capacity engine will alert based on whichever model (Demand or Allocation) runs out of capacity first.