PSOD and VMFS corruption can occur when using HPE Rebranded Qlogic QLE269x series HBAs on HPE/Hitachi Servers

Products

VMware vSphere ESXi

Issue/Introduction

To address the PSOD issue and VMFS corruption.

Symptoms:
When using HPE rebranded Qlogic QLE269x HBAs, ESXi host fails with PSOD as well as potentially VMFS corruption. This applies to the HPE rebranded HBAs SN1100Q & SN1600Q

The most common indicator of an issue with the QLE269x chipset is the following message in /var/log/vmkernel.log, usually repeated numerous times leading up to the PSOD:
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(281) req=1
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(282) req=1
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(284) req=1

Below are additional qlnativefc driver messages that have been observed around the time of the PSOD:
qlnativefc: vmhba3(12:0.1): Inside qlnativefcAbortIsp
qlnativefc: vmhba3(12:0.1): Performing ISP error recovery - ha= 0x4308e6398b50.
qlnativefc: vmhba2(12:0.0): Inconsistent NVRAM detected: checksum=0x8c54d9ed id=^@ version=0x0.
qlnativefc: vmhba2(12:0.0): Falling back to functioning (yet invalid -- WWPN) defaults.
qlnativefc: vmhba2(12:0.0): 83XX: F/W Error Reported: Check if reset required.
qlnativefc: vmhba2(12:0.0): Heartbeat Failure encountered, chip reset required.

When VMFS corruption occurs as a result, the following messages may be observed immediately before the PSOD:
WARNING: Fil3: 7920: Found invalid object on 5c80ba44-31b46bc0-d894-xxxxxxxxxxxx <FD c49 r5> expected <FD c21 r107>

Other messages related to the datastores could be observed after reboot depending on whether the corruption is preventing the VMFS volume from mounting.

The PSOD backtrace has varied across environments however these three examples would be the most common observed:
Example 1

2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b3d0:[0x41801e8edaf1]PanicvPanicInt@vmkernel#nover+0x545 stack: 0x41801e8edaf1, 0x0, 0x4391ae49b478, 0x0, 0x1f48f6cc0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b470:[0x41801e8edb7d]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4391ae49b4d0, 0x4391ae49b490, 0x2000418000000000, 0x41801ec3e2f1, 0x1280
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b4d0:[0x41801e9466a5]DLM_malloc@vmkernel#nover+0x1469 stack: 0x1, 0x18, 0x8, 0x41801eb22010, 0x18
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b520:[0x41801e943a8c]Heap_AlignWithTimeoutAndRA@vmkernel#nover+0xc4 stack: 0x800000000, 0x0, 0x18, 0x4306f4695030, 0x0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b5a0:[0x41801eb21f57]SCSIVsiDeviceWorldListIsDup@vmkernel#nover+0x4f stack: 0x0, 0x41801eb22010, 0x46, 0x4306f48f6cc0, 0x4391ae49b5d0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b5c0:[0x41801eb22010]SCSIVsiDeviceWorldList@vmkernel#nover+0x80 stack: 0x4391ae49b5d0, 0x4391ae49b5d0, 0x0, 0x0, 0x5ca
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b600:[0x41801e80229f]VSI_GetListInfo@vmkernel#nover+0x253 stack: 0x4391ae49b6b0, 0x4391ae49b740, 0x417fde843242, 0xeb19a8, 0x8d
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b680:[0x41801ef1ec03]UWVMKSyscallUnpackVSI_GetList@(user)#<None>+0x24f stack: 0x4391ae49bf30, 0xd99d7b78a037a0d4, 0x8de47592fb, 0x2c000eb5ca0, 0x300000001
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bef0:[0x41801ef124c0]User_UWVMKSyscallHandler@(user)#<None>+0xa4 stack: 0x0, 0x0, 0x0, 0x41801e910025, 0x0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bf20:[0x41801e910025]User_UWVMKSyscallHandler@vmkernel#nover+0x1d stack: 0x0, 0x13b, 0x0, 0x4ad, 0x5ca
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bf30:[0x41801e93d067]gate_entry_@vmkernel#nover+0x0 stack: 0x0, 0x4ad, 0x5ca, 0x0, 0x2896448
2019-11-07T22:22:13.524Z cpu6:67017)ESC[45mESC[33;1mVMware ESXi 6.5.0 [Releasebuild-13004031 x86_64]ESC[0m
PANIC bora/vmkernel/main/dlmalloc.c:4736 - Corruption in dlmalloc

Example 2
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bd30:[0x418022545702]DLM_malloc@vmkernel#nover+0x4c6 stack: 0x4306748bd580, 0xff, 0x8, 0x418022734afd, 0xff
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bd80:[0x418022543a8c]Heap_AlignWithTimeoutAndRA@vmkernel#nover+0xc4 stack: 0x888902a80, 0x418000000000, 0xff, 0x430674618030, 0x430674810ed0
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79be00:[0x4180227326de]SCSI_VaaiCacheUpdate@vmkernel#nover+0x372 stack: 0x43914a79be60, 0x417fe277f010, 0x0, 0x418022734eb7, 0xf000f0300000000
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bf30:[0x418022734afd]SCSIDeviceReclaimByFilters@vmkernel#nover+0x21 stack: 0x4306746a9d00, 0x4180224ca3ed, 0x430198ebb050, 0x25, 0x430198ebb050
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bf50:[0x4180224ca3ed]helpFunc@vmkernel#nover+0x3c5 stack: 0x430198ebb050, 0x0, 0x0, 0x0, 0x0
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bfe0:[0x4180226cb675]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2019-10-07T09:32:34.269Z cpu17:65871)ESC[45mESC[33;1mVMware ESXi 6.5.0 [Releasebuild-11925212 x86_64]ESC[0m
#GP Exception 13 in world 65871:SCSI periodi @ 0x418022545702

Example 3
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9ba30:[0x418032553fdf]DLM_free@vmkernel#nover+0x323 stack: 0x430c7f9e3790, 0x418032551501, 0x430c7fa800a0, 0x57ab5420a5c18, 0x1ff
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9ba50:[0x418032551500]Heap_Free@vmkernel#nover+0x115 stack: 0x1ff, 0x430c7f9e3790, 0x1, 0x0, 0x92a000
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9baa0:[0x41803341b24c]FSAts_Lock@esx#nover+0xf1 stack: 0x0, 0x417fd26044c0, 0x430c00000000, 0x200000000001, 0x4
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bb20:[0x41803340e92f]FS3_DiskBufferLock@esx#nover+0xa0 stack: 0x0, 0x92a000, 0x0, 0x430c7f94c380, 0x430c7f94c380
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bb80:[0x41803340eb47]FS3_DiskLockWithPrefetch@esx#nover+0x150 stack: 0x0, 0x430c7f9cc5e0, 0x4180334319e0, 0x418032515653, 0x1848920
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bc60:[0x4180334adb4e]Res3LockClusterVMFS6@esx#nover+0x9f stack: 0x0, 0x0, 0x430c7e1dbff0, 0x430c7f94a370, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bcd0:[0x4180334bbc4b]Res6OnDiskLockRC@esx#nover+0x1d0 stack: 0x430c7f94a3a8, 0x100000000000000, 0x7d000000005, 0x5, 0x430c7fa63f60
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bd50:[0x4180334c9ecb]UnmapLockCommon@esx#nover+0x4d4 stack: 0x430c7f7d0b50, 0x5, 0x2ff87396e329, 0xffff00000000, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9be30:[0x4180334cc165]UnmapAsyncClusterProcessing@esx#nover+0x2ce stack: 0x4519c00f0000, 0x41803251c74a, 0x0, 0x837bea000001cc01, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bf30:[0x4180324eb1f2]HelperQueueFunc@vmkernel#nover+0x30f stack: 0x430c7e1e2e28, 0x430c7e1e2e18, 0x430c7e1e2e50, 0x451a18ca3000, 0x451a0f89bf60
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bfe0:[0x41803270e322]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2019-09-14T03:36:03.922Z cpu2:2456914)ESC[45mESC[33;1mVMware ESXi 6.7.0 [Releasebuild-13981272 x86_64]ESC[0m
#GP Exception 13 in world 2456914:Unmap Helper @ 0x418032553fdf

“Issue is observed on both ESX 6.5 and ESX 6.7” This is a card firmware issue, not a driver version issue
We have observed this issue mostly on HPE DL360, DL380, & DL580 or Hitachi equivalent rebranded HPE servers (HA8000V).

Note:The preceding log excerpts are only example.Date,time and environmental variables may vary depending on your environment

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.5

Cause

Firmware incorrectly replays stale (or previously completed) I/O requests, including DMA transfers to potentially freed memory locations causing memory heap corruption and getting propagated to disk and some writes getting lost.

Due to this there are number of SCSI read, write, ATS failures and retries logged in the vmkernel leading up to the crash.

Resolution

HPE has identified the root cause for this issue and has created a fix for their rebranded cards - Refer to HPE Advisory

For HPE branded adapters listed in (HPE Advisory) on ESXi6.7 refer to HPE firmware release notes

For HPE branded adapters listed in (HPE Advisory) on ESXi6.5 refer to HPE firmware release notes

For the VMFS Metadata consistency to be fixed the volume should be mounted to atleast one ESXi 6.7 U2 host as VOMA allows you to check and fix issues with VMFS volumes metadata, LVM metadata, and partition table inconsistencies- For more info refer to ESXi 6.7 U2 release notes

Refer to VOMA guide for more information

Workaround:
There is no workaround for this issue.

Additional Information

VMware Skyline Health Diagnostics for vSphere - FAQ

Impact/Risks:
If the datastores are affected, it is required to fixed using vSphere On-disk Metadata Analyzer(VOMA).