ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

ESXi hosts running 5.5 p10, 5.5 ep11, 6.0 p04, 6.0 U3, or 6.5 GA may fail with a purple diagnostic screen caused by non-maskable-interrupts (NMI) on HPE ProLiant Gen8 Servers.
Intermittent purple diagnostic screens citing an NMI, Non-Maskable, or LINT1 interrupt similar to:

YYYY-MM-DD HH:MM:SS cpu0:33074)@BlueScreen: LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed. This may be a hardware problem; please contact your hardware vendor.
YYYY-MM-DD HH:MM:SS cpu0:33074)Code start: 0x41800d200000 VMK uptime: 1:10:11:25.236
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b1b0:[0x41800d2780da]PanicvPanicInt@vmkernel#nover+0x37e stack: 0x4390c991b248
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b240:[0x41800d2783a5]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4390c991b2a0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b2a0:[0x41800d274373]NMICheckLint1Bottom@vmkernel#nover+0x53 stack: 0x4390c991b370
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b2b0:[0x41800d23307e]BH_DrainAndDisableInterrupts@vmkernel#nover+0xe2 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b340:[0x41800d256e22]IDT_IntrHandler@vmkernel#nover+0x1c6 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b370:[0x41800d2c8044]gate_entry_@vmkernel#nover+0x0 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b430:[0x41800d5048aa]Power_HaltPCPU@vmkernel#nover+0x1ee stack: 0x417fcd483f20
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b480:[0x41800d411c48]CpuSchedIdleLoopInt@vmkernel#nover+0x2f8 stack: 0x117308c314611
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b500:[0x41800d4153a3]CpuSchedDispatch@vmkernel#nover+0x16b3 stack: 0x4394002a7100
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b620:[0x41800d415f68]CpuSchedWait@vmkernel#nover+0x240 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b6a0:[0x41800d4162a5]CpuSchedTimedWaitInt@vmkernel#nover+0xc9 stack: 0x2001
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b720:[0x41800d416376]CpuSched_TimedWait@vmkernel#nover+0x36 stack: 0x430337ad30c0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991b740:[0x41800d219228]PageCacheAdjustSize@vmkernel#nover+0x344 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)0x4390c991bfd0:[0x41800d416bfe]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0
YYYY-MM-DD HH:MM:SS cpu0:33074)base fs=0x0 gs=0x418040000000 Kgs=0x0

Environment

VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.5

Cause

The issue was triggered by a change in ESXi 5.5 p10, 5.5 ep11, 6.0 p04, 6.0 U3 and, 6.5 GA in which ESXi disables the Intel IOMMU's (aka VT-d) interrupt remapper functionality. In HPE ProLiant Gen8 servers, this change causes PCI errors which result in the platform generating an NMI and causing the ESXi host to fail with a purple diagnostic screen.

HPE has identified the cause of the issue on the HPE ProLiant DL560 Gen8 server and HPE ProLiant DL380p Gen8 server as high performing, low-latency PCIe adapters installed in slot 3 and systems under heavy load. For more information, see HPE CUSTOMER ADVISORY.

Disclaimer: VMware is not responsible for the reliability of any data, opinions, advice, or statements made on third-party websites. Inclusion of such links does not imply that VMware endorses, recommends, or accepts any responsibility for the content of such sites.

Resolution

This is a known issue affecting ESXi 5.5 p10, ESXi 5.5 ep11, ESXi 6.0 p04, 6.0 U3 and, ESXi 6.5 GA on HPE ProLiant Gen8 servers. This information is also available for reference on the HPE advisory.

Alternatively,

To resolve this issue, on the HPE ProLiant DL560 Gen8 server or the HPE ProLiant DL380p Gen8 Server when the IOMMU remapper is disabled, move the low-latency or high performing PCI-e card to slot 1,2,4,5 or 6 (depending on the type of secondary riser board that might be installed).

To work around this issue, re-enable the Intel IOMMU interrupt remapper on the ESXi host:

Connect to the ESXi host with an SSH session and root credentials.
Run this command:

esxcli system settings kernel set --setting=iovDisableIR -v FALSE
Reboot the ESXi host.
Ensure that the iovDisableIR setting is set to FALSE by running this command:

esxcli system settings kernel list -o iovDisableIR

For example:

esxcli system settings kernel list -o iovDisableIR

Name Type Description Configured Runtime Default
------------ ---- --------------------------------------- ---------- ------- -------
iovDisableIR Bool Disable Interrupt Routing in the IOMMU... FALSE FALSE TRUE