VMs are crashing or failing to boot on specific ESXi hosts within a cluster

search cancel

VMs are crashing or failing to boot on specific ESXi hosts within a cluster

book

Article ID: 430776

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

VMs crashing or failing to boot due to flaws in the NUMA scheduler

Environment

VMware vCenter Server
VMware ESXi

Cause

There's a flaw in the NUMA initial placement algorithm that can cause all VMs powered on within a short window to be placed on the same NUMA node. When a mix of low-latency and standard VMs is present, the scheduler may underestimate the CPU utilization of the low-latency VMs. This underestimation prevents the normal VMs from being migrated to other, less-congested NUMA nodes.

Resolution

This behavior is addressed and will be resolved in an upcoming vSphere 9.x release. The updated NUMA placement in 9.x will consider the demand of all vCPUs, explicitly treating the demand of low-latency VMs as 100%. This ensures nodes are correctly identified as high-load, allowing the NUMA scheduler to move new VMs away from saturated nodes rapidly.

Workaround

Change the NUMA_INITIAL_PLACEMENT_LOAD_THRESHOLD option to 100 on the affected ESXi hosts.

Log in to the affected ESXi host via SSH as root.
Execute the following command to modify the threshold:

esxcfg-advcfg -s 100 /Numa/InitialPlacementLoadThreshold

Additional Information

Applying via Host Profiles
This configuration can be applied uniformly across multiple hosts by setting the Numa.InitialPlacementLoadThreshold option to 100 within a Host Profile:

In the vSphere Client, navigate to Policies and Profiles > Host Profiles.
Select the target profile and click Edit Host Profile.
Expand Advanced configuration settings > Advanced options > Advanced configuration option.
Click the + (Add) button (next to Advanced configuration option).
Select Configure a fixed option.
In the Option Name field, enter: Numa.InitialPlacementLoadThreshold
Set the Value to 100.
Save the profile and Remediate the target hosts.

Verification
To verify the threshold value before or after remediation, run the following command on the ESXi host:

esxcfg-advcfg -g /Numa/InitialPlacementLoadThreshold

Feedback

thumb_up Yes

thumb_down No