ESXi (all versions)
vSAN (all versions)
The issue is driven by severe CPU contention and hypervisor resource exhaustion.
When a VM is configured with a vCPU count equal to the host's total logical processors (e.g., 112 vCPUs on a 112 pCPU host), a "workload burst"—such as an increase in average IO size from 128k to 260k—triggers the following:
Hyper-threading Performance Loss: vCPUs executing on hyper-threaded logical cores can see a performance decrease of 10% to 40% compared to physical cores.
Hypervisor Starvation: With the VM utilizing 100% of logical cores, the ESXi hypervisor lacks the cycles required to manage critical background processes, including networking, physical storage I/O, and vSAN stack operations.
High CO-STOP: The ESXi scheduler must coordinate free logical slots to run the VM. When the vCPU count matches the pCPU count significant "co-stop" delays occur during high demand, which manifest as increased storage latency, high physical CPU utilization on the host, and locked up processes impacting VM, networking, and ESXi management operations.
To ensure the ESXi host retains enough unreserved compute resources for hypervisor scheduling and vSAN operations, follow these best practices:
Right-Size vCPUs to Physical Cores: Reduce the VM's vCPU count to match the number of physical cores (e.g., 56 vCPUs for 56 physical cores/112 logical). This provides each vCPU a dedicated physical execution pipeline and leaves hyper-threads available for background host tasks.
Adhere to Sizing Guidelines: Refer to the vSphere 8.0 Performance Best Practices regarding virtual machine processor configurations.
Dedicated Hosting: If the high vCPU count is strictly required by the application/vendor, move the VM to a dedicated host. Note that resource contention with system processes may still occur during peak bursts if the VM is not right-sized.