A host in the environment experienced a Purple Screen of Death (PSOD). Upon review, the PSOD was caused by a CPU thread becoming unresponsive while holding a lock, and multiple physical CPUs (PCPUs) were also found to be unresponsive to non-maskable interrupts (NMIs).
The host crashed unexpectedly and displayed a PSOD.
vSphere ESXi 7.X
vSphere ESXi 8.X
The PSOD was due to hardware-level behavior where a physical CPU was unable to release a lock due to being overwhelmed by platform interrupts. This is symptomatic of a known issue often referred to as an iLO interrupt storm, especially common in AMD EPYC-based servers.
This is a hardware-related issue. VMware recommends the following steps:
Contact the server hardware vendor (e.g., HPE) and provide them with full logs and PSOD screenshots or dumps.
Refer the hardware vendor to known issues like:
HPE advisory: HPE a00143662en_us
Discuss firmware/BIOS or iLO updates that may help mitigate interrupt storms or improve CPU interrupt handling.
Broadcom does not have control over hardware interrupt behavior; this type of issue must be addressed at the firmware or hardware design level.
Ensure hosts are running the latest supported BIOS, iLO firmware, and ESXi version validated by the server vendor.