ESXi hosts that use HP CRU driver fail with a purple diagnostic screen when ECC events occur

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

The ESXi host displays a purple diagnostic screen
The purple diagnostic screen contains backtraces similar to:

VMware ESXi 4.1.0 [Releasebuild-320137 X86_64]
@BlueScreen: PCPU 12 locked up. Failed to ack TLB invalidate (0 others locked up).
0:15:15:20.658 cpu16:5594)Code start: 0x418007200000 VMK uptime: 0:15:15:20.658
0:15:15:20.658 cpu16:5594)Saved backtrace from: pcpu 12 TLB NMI
0:15:15:20.658 cpu16:5594)0xffffffffa0ed0e11:[0x418007a15397]Unknown+0x0 stack: 0x0
0:15:15:20.659 cpu12:7532)0xffffffffa0ed0e11:[0x418007a15397]hpq_cru@vmkernel:nover+0x396 stack: 0x0
0:15:15:20.664 cpu16:5594)FSbase:0x0 GSbase:0x418044000000 kernelGSbase:0x190d4b90
0:15:15:20.657 cpu12:7532)NMI: 2020: NMI IPI recvd. We Halt. eip(base):ebp:cs [0x815397(0x418007200000):0xffffffffa0ed0e11:0x4010](Src0x2, CPU12)
0:15:15:20.664 cpu16:5594)Backtrace for current CPU #16, worldID=5594, ebp=0x417f82ed79f8
0:15:15:20.667 cpu16:5594)0x417f82ed79f8:[0x418007257da5]PanicLogBacktrace@vmkernel:nover+0x18 stack: 0x2032312055504350, 0x4
0:15:15:20.667 cpu16:5594)0x417f82ed7b38:[0x418007258087]PanicvPanicInt@vmkernel:nover+0x24e stack: 0x3000000010, 0x417f82ed7
0:15:15:20.668 cpu16:5594)0x417f82ed7c18:[0x418007258679]Panic_WithBacktrace@vmkernel:nover+0xa8 stack: 0x417f82ed7c58, 0x0,
0:15:15:20.668 cpu16:5594)0x417f82ed7cb8:[0x41800727d594]TLBDoInvalidate@vmkernel:nover+0x4db stack: 0x83e4cd00119f71, 0x3540
0:15:15:20.669 cpu16:5594)0x417f82ed7db8:[0x418007376ee7]UserMemUnmapStateCleanup@vmkernel:nover+0x11a stack: 0x417f82ed7e00,
0:15:15:20.670 cpu16:5594)0x417f82ed7e78:[0x418007377767]UserMemUnmap@vmkernel:nover+0x102 stack: 0x41009ac09950, 0x0, 0x417f
0:15:15:20.670 cpu16:5594)0x417f82ed7eb8:[0x41800737bd1c]UserMem_Unmap@vmkernel:nover+0xe3 stack: 0x417f82ed7f18, 0x418007366
0:15:15:20.671 cpu16:5594)0x417f82ed7ec8:[0x41800738f55f]LinuxMem_Munmap@vmkernel:nover+0x5a stack: 0x41009ac09950, 0x0, 0x5b
0:15:15:20.671 cpu16:5594)0x417f82ed7f18:[0x418007366425]User_LinuxSyscallHandler@vmkernel:nover+0xf8 stack: 0x190d37d8, 0x18
0:15:15:20.672 cpu16:5594)0x417f82ed7f28:[0x4180072db5d7]gate_entry@vmkernel:nover+0x46 stack: 0x0, 0x13b, 0x5b, 0x1a000, 0x1
HP IML logs or System Event log (SEL) may report NMI error similar to:

An Unrecoverable System Error (NMI) has occurred (System error code 0x00000032, 0x10426844)

Environment

VMware ESXi 4.1.x Embedded
VMware vSphere ESXi 5.0
VMware ESXi 4.1.x Installable
VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.5

Cause

This issue occurs due to the interaction between HP CRU driver and system ROM BIOS while handling corrected ECC errors.

Resolution

This issue is resolved in the BIOS released on 5/10/2011 or later. For more information, see the HP Customer Advisory c03065184 and BIOS System ROM dated 05/xx/2011 or later.

Note: The preceding link was correct as of October 22, 2013. If you find the link is broken, provide feedback and a VMware employee will update the link.

To work around this issue, disable the smx provider.

Connect to the ESXi host using the vSphere Client or vCenter Server.
Click Host > Configuration > Software > Advanced Settings > UserVars.
Change the UserVars.CIMoemProviderEnabled value from 1 to 0.
Run this command to restart sfcbd:

# /etc/init.d/sfcbd-watchdog restart

Additional Information

ECC イベント発生時に HP CRU ドライバを使用する ESXi ホストが失敗し、紫色の診断画面が表示される