To address the PSOD issue and VMFS corruption.
Symptoms:
When using HPE rebranded Qlogic QLE269x HBAs, ESXi host fails with PSOD as well as potentially VMFS corruption. This applies to the HPE rebranded HBAs SN1100Q & SN1600Q
The most common indicator of an issue with the QLE269x chipset is the following message in /var/log/vmkernel.log, usually repeated numerous times leading up to the PSOD:
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(281) req=1
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(282) req=1
WARNING: qlnativefc: vmhba2(12:0.0): Invalid ISP SCSI completion handle(284) req=1
Below are additional qlnativefc driver messages that have been observed around the time of the PSOD:
qlnativefc: vmhba3(12:0.1): Inside qlnativefcAbortIsp
qlnativefc: vmhba3(12:0.1): Performing ISP error recovery - ha= 0x4308e6398b50.
qlnativefc: vmhba2(12:0.0): Inconsistent NVRAM detected: checksum=0x8c54d9ed id=^@ version=0x0.
qlnativefc: vmhba2(12:0.0): Falling back to functioning (yet invalid -- WWPN) defaults.
qlnativefc: vmhba2(12:0.0): 83XX: F/W Error Reported: Check if reset required.
qlnativefc: vmhba2(12:0.0): Heartbeat Failure encountered, chip reset required.
When VMFS corruption occurs as a result, the following messages may be observed immediately before the PSOD:
WARNING: Fil3: 7920: Found invalid object on 5c80ba44-31b46bc0-d894-xxxxxxxxxxxx <FD c49 r5> expected <FD c21 r107>
Other messages related to the datastores could be observed after reboot depending on whether the corruption is preventing the VMFS volume from mounting.
The PSOD backtrace has varied across environments however these three examples would be the most common observed:
Example 1
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b3d0:[0x41801e8edaf1]PanicvPanicInt@vmkernel#nover+0x545 stack: 0x41801e8edaf1, 0x0, 0x4391ae49b478, 0x0, 0x1f48f6cc0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b470:[0x41801e8edb7d]Panic_NoSave@vmkernel#nover+0x4d stack: 0x4391ae49b4d0, 0x4391ae49b490, 0x2000418000000000, 0x41801ec3e2f1, 0x1280
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b4d0:[0x41801e9466a5]DLM_malloc@vmkernel#nover+0x1469 stack: 0x1, 0x18, 0x8, 0x41801eb22010, 0x18
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b520:[0x41801e943a8c]Heap_AlignWithTimeoutAndRA@vmkernel#nover+0xc4 stack: 0x800000000, 0x0, 0x18, 0x4306f4695030, 0x0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b5a0:[0x41801eb21f57]SCSIVsiDeviceWorldListIsDup@vmkernel#nover+0x4f stack: 0x0, 0x41801eb22010, 0x46, 0x4306f48f6cc0, 0x4391ae49b5d0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b5c0:[0x41801eb22010]SCSIVsiDeviceWorldList@vmkernel#nover+0x80 stack: 0x4391ae49b5d0, 0x4391ae49b5d0, 0x0, 0x0, 0x5ca
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b600:[0x41801e80229f]VSI_GetListInfo@vmkernel#nover+0x253 stack: 0x4391ae49b6b0, 0x4391ae49b740, 0x417fde843242, 0xeb19a8, 0x8d
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49b680:[0x41801ef1ec03]UWVMKSyscallUnpackVSI_GetList@(user)#<None>+0x24f stack: 0x4391ae49bf30, 0xd99d7b78a037a0d4, 0x8de47592fb, 0x2c000eb5ca0, 0x300000001
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bef0:[0x41801ef124c0]User_UWVMKSyscallHandler@(user)#<None>+0xa4 stack: 0x0, 0x0, 0x0, 0x41801e910025, 0x0
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bf20:[0x41801e910025]User_UWVMKSyscallHandler@vmkernel#nover+0x1d stack: 0x0, 0x13b, 0x0, 0x4ad, 0x5ca
2019-11-07T22:22:13.500Z cpu6:67017)0x4391ae49bf30:[0x41801e93d067]gate_entry_@vmkernel#nover+0x0 stack: 0x0, 0x4ad, 0x5ca, 0x0, 0x2896448
2019-11-07T22:22:13.524Z cpu6:67017)ESC[45mESC[33;1mVMware ESXi 6.5.0 [Releasebuild-13004031 x86_64]ESC[0m
PANIC bora/vmkernel/main/dlmalloc.c:4736 - Corruption in dlmalloc
Example 2
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bd30:[0x418022545702]DLM_malloc@vmkernel#nover+0x4c6 stack: 0x4306748bd580, 0xff, 0x8, 0x418022734afd, 0xff
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bd80:[0x418022543a8c]Heap_AlignWithTimeoutAndRA@vmkernel#nover+0xc4 stack: 0x888902a80, 0x418000000000, 0xff, 0x430674618030, 0x430674810ed0
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79be00:[0x4180227326de]SCSI_VaaiCacheUpdate@vmkernel#nover+0x372 stack: 0x43914a79be60, 0x417fe277f010, 0x0, 0x418022734eb7, 0xf000f0300000000
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bf30:[0x418022734afd]SCSIDeviceReclaimByFilters@vmkernel#nover+0x21 stack: 0x4306746a9d00, 0x4180224ca3ed, 0x430198ebb050, 0x25, 0x430198ebb050
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bf50:[0x4180224ca3ed]helpFunc@vmkernel#nover+0x3c5 stack: 0x430198ebb050, 0x0, 0x0, 0x0, 0x0
2019-10-07T09:32:34.245Z cpu17:65871)0x43914a79bfe0:[0x4180226cb675]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2019-10-07T09:32:34.269Z cpu17:65871)ESC[45mESC[33;1mVMware ESXi 6.5.0 [Releasebuild-11925212 x86_64]ESC[0m
#GP Exception 13 in world 65871:SCSI periodi @ 0x418022545702
Example 3
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9ba30:[0x418032553fdf]DLM_free@vmkernel#nover+0x323 stack: 0x430c7f9e3790, 0x418032551501, 0x430c7fa800a0, 0x57ab5420a5c18, 0x1ff
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9ba50:[0x418032551500]Heap_Free@vmkernel#nover+0x115 stack: 0x1ff, 0x430c7f9e3790, 0x1, 0x0, 0x92a000
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9baa0:[0x41803341b24c]FSAts_Lock@esx#nover+0xf1 stack: 0x0, 0x417fd26044c0, 0x430c00000000, 0x200000000001, 0x4
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bb20:[0x41803340e92f]FS3_DiskBufferLock@esx#nover+0xa0 stack: 0x0, 0x92a000, 0x0, 0x430c7f94c380, 0x430c7f94c380
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bb80:[0x41803340eb47]FS3_DiskLockWithPrefetch@esx#nover+0x150 stack: 0x0, 0x430c7f9cc5e0, 0x4180334319e0, 0x418032515653, 0x1848920
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bc60:[0x4180334adb4e]Res3LockClusterVMFS6@esx#nover+0x9f stack: 0x0, 0x0, 0x430c7e1dbff0, 0x430c7f94a370, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bcd0:[0x4180334bbc4b]Res6OnDiskLockRC@esx#nover+0x1d0 stack: 0x430c7f94a3a8, 0x100000000000000, 0x7d000000005, 0x5, 0x430c7fa63f60
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bd50:[0x4180334c9ecb]UnmapLockCommon@esx#nover+0x4d4 stack: 0x430c7f7d0b50, 0x5, 0x2ff87396e329, 0xffff00000000, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9be30:[0x4180334cc165]UnmapAsyncClusterProcessing@esx#nover+0x2ce stack: 0x4519c00f0000, 0x41803251c74a, 0x0, 0x837bea000001cc01, 0x0
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bf30:[0x4180324eb1f2]HelperQueueFunc@vmkernel#nover+0x30f stack: 0x430c7e1e2e28, 0x430c7e1e2e18, 0x430c7e1e2e50, 0x451a18ca3000, 0x451a0f89bf60
2019-09-14T03:36:03.898Z cpu2:2456914)0x451a18c9bfe0:[0x41803270e322]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0
2019-09-14T03:36:03.922Z cpu2:2456914)ESC[45mESC[33;1mVMware ESXi 6.7.0 [Releasebuild-13981272 x86_64]ESC[0m
#GP Exception 13 in world 2456914:Unmap Helper @ 0x418032553fdf
“Issue is observed on both ESX 6.5 and ESX 6.7” This is a card firmware issue, not a driver version issue
We have observed this issue mostly on HPE DL360, DL380, & DL580 or Hitachi equivalent rebranded HPE servers (HA8000V).
Note:The preceding log excerpts are only example.Date,time and environmental variables may vary depending on your environment