ESXi host experiences a Purple Diagnostic Screen referencing __lpfc_sli_get

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
On an ESXi host that has Emulex FCoE HBAs installed and HPE Agentless Management Service (AMS) up and running after 50 days of the driver being loaded, you experience these symptoms:

The ESXi host fails with a Purple Diagnostic Screen.
The backtrace contains entries similar to:

2018-12-16T17:09:34.886Z cpu3:29864499)Backtrace for current CPU #3, worldID=29864499, fp=0x430843804c90
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199b998:[0x418032febcd5]__lpfc_sli_get_iocbq@(brcmfcoe)#<None>+0x51 stack: 0x4308438b6010, 0x430843804c90, 0x418032ff5d4c, 0x0, 0x4180328c2fc1
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199b9a0:[0x418032fec4d1]lpfc_sli_get_iocbq@(brcmfcoe)#<None>+0x1d stack: 0x430843804c90, 0x418032ff5d4c, 0x0, 0x4180328c2fc1, 0x202
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199b9c0:[0x418032ff5d4c]lpfc_sli4_handle_eqe@(brcmfcoe)#<None>+0x354 stack: 0x202, 0x2b9a, 0x0, 0x0, 0x2b9a
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199ba70:[0x418032ff6c49]lpfc_sli4_intr_bh_handler@(brcmfcoe)#<None>+0x89 stack: 0x430185f56dd8, 0x430185f56d70, 0x0, 0x4180326d1a44, 0x418040c00000
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199baa0:[0x4180326d1a44]IntrCookieBH@vmkernel#nover+0x1e0 stack: 0x0, 0x0, 0x4392d199bad0, 0x439100000001, 0x430185f56d70
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bb40:[0x4180326b1cb0]BH_DrainAndDisableInterrupts@vmkernel#nover+0x100 stack: 0x4392d199bc20, 0x8e010001b8, 0x0, 0x418040c004c8, 0xffffffffffffffff
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bbd0:[0x4180326d3852]IntrCookie_VmkernelInterrupt@vmkernel#nover+0xc6 stack: 0x8e, 0x0, 0x0, 0x41803272ef9d, 0x4392d199bcb0
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bc00:[0x41803272ef9d]IDT_IntrHandler@vmkernel#nover+0x9d stack: 0x0, 0x41803273d067, 0x0, 0x0, 0x0
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bc20:[0x41803273d067]gate_entry_@vmkernel#nover+0x0 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bce0:[0x41803268bb82]Power_ArchSetCState@vmkernel#nover+0x10a stack: 0x7fffffffffffffff, 0x418040c00000, 0x418040c00080, 0x4180328c6623, 0x0
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bd10:[0x4180328c6623]CpuSchedIdleLoopInt@vmkernel#nover+0x39b stack: 0x8, 0x418040c00120, 0x0, 0x439101388700, 0x439101386000
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bd80:[0x4180328c8eda]CpuSchedDispatch@vmkernel#nover+0x114a stack: 0x410000000001, 0x43949f227100, 0x418040c00108, 0x418040c00120, 0x4392d52a7100
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199beb0:[0x4180328ca152]CpuSchedWait@vmkernel#nover+0x27a stack: 0x1004392d19a7000, 0x3a167348c888d6, 0x800000000, 0x41002d80c7c0, 0x0
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bf30:[0x4180328ca86c]CpuSched_VcpuHalt@vmkernel#nover+0x104 stack: 0x100002001, 0xffffffe3, 0x7, 0x4392d52a7100, 0x401
2018-12-16T17:09:34.886Z cpu3:29864499)0x4392d199bf80:[0x418032719e27]VMMVMKCall_Call@vmkernel#nover+0x157 stack: 0x4392d199bfec, 0x24600000000, 0x41803274b81b, 0xfffffffffc607c08, 0x0
2018-12-16T17:09:34.887Z cpu3:29864499)0x4392d199bfe0:[0x41803274b8a2]VMKVMM_ArchEnterVMKernel@vmkernel#nover+0xe stack: 0x41803274b894, 0xfffffffffc4074e6, 0x0, 0x0, 0x0
2018-12-16T17:09:34.921Z cpu3:29864499)_[45m_[33;1mVMware ESXi 6.5.0 [Releasebuild-10175896 x86_64]_[0m]
In the vmkernel.log file of the ESXi host, you see these brcmfcoe warnings prior to the PSOD similar to:

2018-12-16T17:09:34.805Z cpu44:68685)WARNING: brcmfcoe: lpfc_sli_issue_iocb_wait:10765: 0:0330 IOCB wake NOT set, Data x24 x0

Note: The preceding log excerpts are only examples. The date, time, and environmental variables may vary depending on your environment.

Cause

The HPE AMS tool periodically sends management commands to the HBA driver. In certain situations, the driver would not receive a response to the management command, and that would result in the driver not cleaning up the command resources used to process the response properly.

This would lead to the driver running out of management command resources. Any further management commands sent to the HBA driver may give a NULL address because of lack of resources and will cause a NULL references hence causing the ESXi to PSOD.

Resolution

This issue is resolved in brcmfcoe 12.0.1278.0 async driver or later versions.

For VMware ESXi 6.5.
For VMware ESXi 6.7.

Notes:

Upgrading to these versions of ESXi upgrades the brcmfcoe driver version to 12.0.1278.0 or later.
This is a known issue affecting VMware ESXi 6.0.x. Currently, there is no resolution for this version of ESXi.