CPU Spin Dead Lock caused by interrupt storm on HPE ProLiant DL385 Gen11 iLO drivers
search cancel

CPU Spin Dead Lock caused by interrupt storm on HPE ProLiant DL385 Gen11 iLO drivers

book

Article ID: 439213

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

PSOD Memory Trace:

 

YYYY-MM-DDTHH:MM:SS.181Z cpu74:2097226)1 other PCPU is in panic.
YYYY-MM-DDTHH:MM:SS.072Z cpu74:2097226)NMI: 738: NMI IPI: PC 0x42003233526e, SP 0x453b0251b8e8 (Src 0x1, CPU74)
YYYY-MM-DDTHH:MM:SS.070Z cpu74:2097226)NMI: 738: NMI IPI: PC 0x42003116ec32, SP 0x453b0251bac8 (Src 0x1, CPU74)
YYYY-MM-DDTHH:MM:SS.872Z cpu204:2098892)Jumpstart plugin restore-nfs-volumes activation failed.
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)Backtrace for current CPU #74, worldID=2097226, fp=0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2d30:[0x42003117be80]PanicvPanicInt@vmkernel#nover+0x20c stack: 0x3734, 0x42003117be80, 0x0, 0x420000000001, 0x42003117be80
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2de0:[0x42003117c656]Panic_WithBacktrace@vmkernel#nover+0x57 stack: 0x452ac04a2e50, 0x452ac04a2e00, 0x453b0251f000, 0x452ac04a2eaf, 0x42003233526e
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2e50:[0x420031178561]NMI_Interrupt@vmkernel#nover+0x516 stack: 0xf6e5c8b2faa5c8b6, 0xcdd55a12c1955a16, 0xcedded30c29ded34, 0xf5ed7f90f9ad7f94, 0x4840c39f4400c39b
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2f10:[0x4200316a6404]IDTNMIWork@vmkernel#nover+0x95 stack: 0x0, 0x4200316a786d, 0x0, 0x4200316a10c7, 0x750
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2f30:[0x4200316a786c]Int2_NMI@vmkernel#nover+0x9 stack: 0x750, 0x750, 0x0, 0x8, 0x24
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x452ac04a2f40:[0x4200316a10c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x8, 0x24, 0x453b0251b960, 0x33
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251b8e8:[0x42003233526e]ehci_filter@(vmkusb)#<None>+0x26 stack: 0x33, 0x4303c6f04eb0, 0x420031161591, 0x0, 0x4303c6e03a00
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251b8f0:[0x4200322b093a][email protected]#1+0xf stack: 0x4303c6f04eb0, 0x420031161591, 0x0, 0x4303c6e03a00, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251b910:[0x420031161590]IntrCookie_DoInterrupt@vmkernel#nover+0x5a1 stack: 0x0, 0x1980, 0x453b0251ba00, 0x100001980, 0x33
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251b9c0:[0x420031161693]IntrCookie_VmkernelInterrupt@vmkernel#nover+0x38 stack: 0x4d, 0x4200316a890b, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251b9e0:[0x4200316a890a]IDT_IntrHandler@vmkernel#nover+0x97 stack: 0x0, 0x4200316a10c7, 0x750, 0x750, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251ba00:[0x4200316a10c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x0, 0x0, 0x0, 0x43316537fcf0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bac8:[0x42003116ec32]MCSUnlockWork@vmkernel#nover+0x2e stack: 0x42003216c0c6, 0x453b0251baf0, 0x433164e01480, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bad0:[0x420032164e25]nmlx_Complete@(nmlx5_core)#<None>+0x1a stack: 0x453b0251baf0, 0x433164e01480, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bae0:[0x42003216c0c5]nmlx5_CompleteEnt@(nmlx5_core)#<None>+0x13e stack: 0x0, 0x0, 0x0, 0x1, 0x43316537fcc0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bb20:[0x42003216ca82]nmlx5_CmdCompHandler@(nmlx5_core)#<None>+0x127 stack: 0x452318bba880, 0x433164e01480, 0x420032214c40, 0x4200321d854f, 0x80
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bb60:[0x420032170195]nmlx5_MSIxISR@(nmlx5_core)#<None>+0x1fa stack: 0x453b0251bb88, 0x0, 0x453b0251bb90, 0x4303c6f0f390, 0x18b
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bbc0:[0x42003115fa3b]IntrCookieBH@vmkernel#nover+0x170 stack: 0x4303c6f04ea0, 0x1, 0x4303c6f04ea0, 0x4303c6f0f320, 0x3a
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bc50:[0x42003113f98c]BH_DrainAndDisableInterrupts@vmkernel#nover+0x159 stack: 0x420052801570, 0x8d2639427aacd, 0x0, 0x100000000, 0x420052801040
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bcc0:[0x4200311616ff]IntrCookie_VmkernelInterrupt@vmkernel#nover+0xa4 stack: 0x4d, 0x4200316a890b, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bce0:[0x4200316a890a]IDT_IntrHandler@vmkernel#nover+0x97 stack: 0x0, 0x4200316a10c7, 0x750, 0x750, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bd00:[0x4200316a10c6]gate_entry@vmkernel#nover+0xa7 stack: 0x0, 0x0, 0x8898, 0x414, 0x4303c74cd260
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bdc8:[0x420031090ea7]Power_ArchPerformWait@vmkernel#nover+0x157 stack: 0x420052801880, 0x800000000, 0x100000414, 0x420052800000, 0x420052800000
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bdd0:[0x420031090f75]Power_ArchSetCState@vmkernel#nover+0xba stack: 0x800000000, 0x100000414, 0x420052800000, 0x420052800000, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251be20:[0x4200316da1ed]CpuSchedIdleLoopInt@vmkernel#nover+0x292 stack: 0x0, 0x7fffffffffffffff, 0x1, 0x7fffffffffffffff, 0xfffffffffffffff6
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251be90:[0x4200316dedc5]CpuSched_IdleLoop@vmkernel#nover+0x12 stack: 0x4a, 0x4200310804c8, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251bea0:[0x42003115efaa]Init_APIdle@vmkernel#nover+0x3f stack: 0x0, 0x0, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)0x453b0251beb0:[0x4200310804c7]SMPAPIdle@vmkernel#nover+0x27c stack: 0x0, 0x0, 0x0, 0x0, 0x0
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)CPU model name: AMD EPYC 9554 64-Core Processor                , FMS: 19/11/1, uCodeRev: XXXXXXXXX
YYYY-MM-DDTHH:MM:SS.182Z cpu74:2097226)PRODUCTNAME:ProLiant DL385 Gen11, VENDORNAME:HPE, SERIAL_NUMBER:XXXXXXXXXXX, SERVER_UUID:3XXXXXXX0-3XX1-5XXX-32XX-3XXXXXXXXXXXX, VERSION:, SKU:XXXXXXXX1, FAMILY:ProLiant

Environment

ESXi 8.x

Cause

Memory dumps show the symptom of interrupt storm from iLO on server HPE ProLiant DL385 Gen11.

As in the following HPE KB.
https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00143662en_us 

Resolution

We recommend updating iLO driver to 10.9.1 or later to address the issue.