ESXi Server crashes with PSOD in response to Non Maskable Interrupt (NMI) triggered due to erroring PCI Device.
Crash logs/Screen Capture contain event similar to the followingcpu0:2098206)ApeiHEST: 233: Invoked HestNMIHandler
cpu0:2098206)ApeiHEST: 259: Uncorrectable Errors
cpu0:2098206)ApeiHEST: 294: Error Event Severity: Fatal
cpu0:2098206)ALERT: ApeiHEST: 327: Fatal error from 0000:XX:00.0(PCI Express Endpoint), VID:####, DID:#### DevSts: 0xd, AERUeSts: 0x2000.
cpu0:2098206)NMI: 1031: ApeiHESTNmiHandler requested PSOD
This PSOD is in response to an NMI raised by CPU as part of notifying VMkernel of a failing/erroring PCI Device.
Engage the hardware vendor with a screenshot of the PSOD screen and the device details for further diagnostics and troubleshooting.
How to identify the failing device:
PCI device address of the failing device will be part of the ALERT seen on the PSOD screen or the crash logs. From the example stated in the introduction ALERT: ApeiHEST: 327: Fatal error from 0000:XX:00.0
(PCI Express Endpoint), VID:####, DID:####, DevSts: 0xd, AERUeSts: 0x2000.
Use this device address (ID) to locate the device in the vSphere UI following steps listed below.
Alternatively, you can view the device details using the lspci command in ESXi Shell to list the device.
Example:
#lspci |grep 0000:XX:00.0
Output:0000:XX:00.0 <Device Class>: <Device Vendor> <Device Model>