The ESXi host experiences a Purple Screen of Death (PSOD) with the error message: NMI IPI: Panic requested by another PCPU.
When reviewing the vmkernel dump logs, the backtrace points toward the Mellanox nmlx5_core driver, specifically during lock handling or command completion. The following log entry is characteristic of this issue:
YYYY-MM-DDTHH:MM:SS cpu##:2141318)@BlueScreen: NMI IPI: Panic requested by another PCPU. PC 0x42001c76ec2f, SP 0x453d31a9bba8 (Src 0x1, CPU##)
YYYY-MM-DDTHH:MM:SS cpu##:2141318)Code start: 0x42001c600000 VMK uptime: #:##:##:##.###
YYYY-MM-DDTHH:MM:SS cpu##:2141318)Saved backtrace from: pcpu 207 Heartbeat NMI
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bba8:[0x42001c76ec2e]MCSUnlockWork@vmkernel#nover+0x2b stack: 0x42001d7b40c6
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bbb0:[0x42001d7ace25]nmlx_Complete@(nmlx5_core)#<None>+0x1a stack: 0x3a18
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bbc0:[0x42001d7b40c5]nmlx5_CompleteEnt@(nmlx5_core)#<None>+0x13e stack: 0x0
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bc00:[0x42001d7b4a82]nmlx5_CmdCompHandler@(nmlx5_core)#<None>+0x127 stack: 0x4525a89eaa80
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bc40:[0x42001d7b8195]nmlx5_MSIxISR@(nmlx5_core)#<None>+0x1fa stack: 0x73c06840
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bca0:[0x42001c75fa3b]IntrCookieBH@vmkernel#nover+0x170 stack: 0x453d31a9bcc0
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bd30:[0x42001c73fc65]BH_Check@vmkernel#nover+0x11e stack: 0x0
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bda0:[0x42001ccd302d]CpuSchedPreemptionPointInt@vmkernel#nover+0x22 stack: 0xd1ac0a
YYYY-MM-DDTHH:MM:SS cpu##:2141318)0x453d31a9bdb0:[0x42001ccd5a2d]CpuSched_SafePreemptionPoint@vmkernel#nover+0x16 stack: 0x7
YYYY-MM-DDTHH:MM:SS cpu##:2141318)base fs=0x0 gs=0x420073c00000 Kgs=0x0
YYYY-MM-DDTHH:MM:SS cpu##:2141318)1 other PCPU is in panic.
Note: All VMNICs using the nmlx5_core driver in this environment are confirmed to be on the VMware Compatibility Guide (VCG).
VMware ESXi 8.0
The root cause is identified as a race condition within the nmlx5_core driver's lock handling mechanism.
A fix is planned for an upcoming patch release of ESXi 8.0 U3. This issue has already been resolved in ESXi 9.0 and later versions.
If you encounter this issue, please collect the vm-support bundle and specifically verify the version of the nmlx5-core VIB using the following command
esxcli software vib list | grep nmlx5-core