ESXi crashes with PSOD due to LUN corruption
search cancel

ESXi crashes with PSOD due to LUN corruption

book

Article ID: 396448

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Host repeatedly boots into a Purple Diagnostic screen (PSOD)

Environment

VMware vSphere ESXi 6.5.x
VMware vSphere ESXi 6.7.x
VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

ESXi can encounter a PSOD if there is underlying corruption on a VMFS volume.

The PSOD Backtrace may showthe following entries:

0x123a1a45b678:[0x1234567a89b1c]Res6NewLinkCacheEntry@esx#nover+0x1b6 stack: 0x123cbf79c68
0x123a1a45b678:[0x1234567a89b1c]Res6_OnDiskLockRC@esx#nover+0x3b7 stack: 0x7d01
0x123a1a45b678:[0x1234567a89b1c]Res60nDiskLockRC@esx#nover+0x49 stack: 0x123ca787120
0x123a1a45b678:[0x1234567a89b1c]Res3PopulateClusterCacheFromAddrVecVMFS6@esx#nover+0x18d stack: 0x1235cbf63e00
0x123a1a45b678:[0x1234567a89b1c]Res3TraverseAddrsVMFS6@esx#nover+0x14c stack: 0x5cbf58800
0x123a1a45b678:[0x1234567a89b1c]Res3FreeVMFS6@esx#nover+0xe8 stack:0x0
0x123a1a45b678:[0x1234567a89b1c]Fil3DeallocateFileTxnVMFS6@esx#nover+0x52d stack: 0x1235cvf62e00
0x123a1a45b678:[0x1234567a89b1c]Fil3RemoveHelperVMFS6@esx#nover+0xda stack:0x1235ca7887b0
0x123a1a45b678:[0x1234567a89b1c]Fil3RemoveVMFS6@esx#nover+0x1c09 stack: 0x1235cbf58800
0x123a1a45b678:[0x1234567a89b1c]Fil3_Unlink@esx#nover+0x114 stack: 0x123c8601fea8
0x123a1a45b678:[0x1234567a89b1c]FSSVec_Unlink@vmkernel#nover+0x20 stack: 0x123a2739bdc0
0x123a1a45b678:[0x1234567a89b1c]FSS_Unlink@vmkernel#nover+0xaa stack 0x0

var/log/vmkernel.log may show the following:

[YYYY-MM-DDTHH:MM:SS] cpu51:1234567)qedf:vmhba65:qedfc_scsi_completion:1808:Error: dropped Frame xid[0x5dd] lba=0x0 lbc=0x0 cmd a3:a:0:0:0 data returned 32 required data 4 fw_resid 3032
[YYYY-MM-DDTHH:MM:SS] cpu26:1234567)NMP: nmp_ThrottleLogForDevice:3867: Cmd 0xa3 (0x45ca504bd288, 0) to dev "naa.12345680000198765432101234567890" on path "vmhba65:C0:T4:L17" Failed:
[YYYY-MM-DDTHH:MM:SS] cpu26:1234567)NMP: nmp_ThrottleLogForDevice:3875: H:0x2 D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x453a1299bb98 CmdSN 0x0
[YYYY-MM-DDTHH:MM:SS] cpu34:1234567)WARNING: ScsiScan: 456: Path 'vmhba65:C0:T4:L17': Possible LUN change?changed from supporting to not supporting VPD Serial ID page
[YYYY-MM-DDTHH:MM:SS] cpu34:1234567)ALERT: NMP: vmk_NmpVerifyPathUID:1344: UID of a device (path vmhba65:C0:T4:L17) has changed from naa.12345680000198765432101234567890 to . Critical error if data LUN.
[YYYY-MM-DDTHH:MM:SS] cpu65:1234567)qedf:vmhba65:qedfc_scsi_completion:1808:Error: dropped Frame xid[0x1d2] lba=0x0 lbc=0x0 cmd a3:a:0:0:0 data returned 15 required data 4 fw_resid 3049
[YYYY-MM-DDTHH:MM:SS] cpu88:1234567)WARNING: Res3: 7066: Volume 12345678-ab8af5e8-8c92-12345678 ("Datastore_Name") might be damaged on the disk. Resource cluster metadata corruption has been detected.
[YYYY-MM-DDTHH:MM:SS] cpu88:1234567)WARNING: FS3: 608: VMFS volume Datastore_Name/12345678-ab8af5e8-8c92-12345678 on naa.12345680000198765432101234567890:1 has been detected corrupted
[YYYY-MM-DDTHH:MM:SS] cpu88:1234567)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
[YYYY-MM-DDTHH:MM:SS] cpu88:1234567)FS3: 634: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.12345680000198765432101234567890:1 -D X`
[YYYY-MM-DDTHH:MM:SS] cpu88:1234567)FS3: 641: where X is the dump file name on a DIFFERENT volume

Resolution

To resolve the issue, do one of the following:

Option 1: Attempt a VOMA fix. Reach out to Broadcom support for assistance Creating and managing Broadcom support cases.

Option 2: Unpresent the offending LUN from the storage array. Once the errors have stopped, perform a rescan on all the hosts that had access to the LUN.

Option 3: Engage your storage vendor to investigate the cause of the corruption.

Additional Information