The purpose of this article is to describe a change automatic mounting behavior for heavily corrupted VMFS-6 datastores post ESXi 8.0
In ESXi versions earlier than ESXi 8.0, all the datastores will be mounted automatically during system bootup. If the resource cluster meta data of one datastore is severely damaged, the ESXi server might crash while loading the meta data and the ESXi server might be left in unusable state. Though this issue could happen for both VMFS5 and VMFS6 datastores, the mounting behavior change only applies for VMFS6 datastores.
With this change, heavily corrupted VMFS6 datastores will not be mounted automatically during boot process, which makes ESXi server being able to complete booting and be available for maintenance or accessing remaining healthy datastores. This change aims at improving system availability and helps customer when severe corruption happens because of any reasons.
2022-06-21T19:20:41.664Z cpu9:1001390796)WARNING: Res3: StatVMFS6:7090: Volume 5fac7e74-b081eb40-a85d-246e96689e22 ("datastore5") might be damaged on the disk. Resource cluster metadata corruption has been detected. 2022-06-21T19:20:41.664Z cpu9:1001390796)WARNING: FS3: LogCorruption:628: VMFS volume datastore5/5fac7e74-b081eb40-a85d-246e96689e22 on naa.55cd2e414e0855f6:1 has been detected corrupted 2022-06-21T19:20:41.664Z cpu9:1001390796)FS3: LogCorruption:631: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them, 2022-06-21T19:20:41.664Z cpu9:1001390796)FS3: LogCorruption:654: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.55cd2e414e0855f6:1 -D X` 2022-06-21T19:20:41.664Z cpu9:1001390796)FS3: LogCorruption:657: where X is the dump file name on a DIFFERENT volume 2022-06-21T19:20:41.665Z cpu9:1001390796)FS3: ReportCorruption:703: On disk volume descriptor verified
0x45391639a940:[0x420018995265]PanicvPanicInt@vmkernel#nover+0x8d stack: 0x0 0x45391639a9d0:[0x4200189951d5]Panic_NoSave@vmkernel#nover+0x56 stack: 0x45391639aa30 0x45391639aa30:[0x4200189960b0]Panic_OnAssertAt@vmkernel#nover+0xbd stack: 0x420019499f0c 0x45391639aab0:[0x420018a15f87]Int6_UD2Assert@vmkernel#nover+0x6c stack: 0x0 0x45391639aad0:[0x420018a0c076]gate_entry@vmkernel#nover+0x77 stack: 0x8001003d 0x45391639ab90:[0x42001a609811]Res3StatVMFS6@esx#nover+0x187d stack: 0xcaaf1a0000 0x45391639aea0:[0x42001a5c4174]Vol3ExtendJournalVMFS6@esx#nover+0xf9 stack: 0x3 0x45391639af30:[0x42001a5c4585]Vol3JournalExtendManager@esx#nover+0x72 stack: 0x2f 0x45391639af50:[0x4200189c7d47]vmkWorldFunc@vmkernel#nover+0x64 stack: 0x4200189c7d42 0x45391639afb0:[0x420018e9ee59]CpuSched_StartWorld@vmkernel#nover+0xfe stack: 0x45390c09f000 0x45391639afe0:[0x420018f0ef00]CpuSched_ArchStartWorld@vmkernel#nover+0x21 stack: 0x0 0x45391639b000:[0x42001893c3fb]Debug_IsInitialized@vmkernel#nover+0x24 stack: 0x0 lastClrIntrRA:0x4200189a7698 2022-06-21T19:20:42.797Z cpu9:1001390796)base fs=0x0 gs=0x420042400000 Kgs=0x0Note: The above task is for the example. The issue is not restricted to the above stack.
VMFS meta data corruption can happen because of various reasons. The goal is to stop mounting a heavily corrupted VMFS datastore to avoid ESXi host getting into reboot loop.
The change is integrated in ESXi 8.0
Based on the nature of corruption we can use VOMA tool to fix the corruption and mount the datastore again.
Please contact VMware support team to perform metadata check, refer to KB: Using vSphere On-disk Metadata Analyzer, on how to use VOMA tool.
Note: Metadata inconsistency in the VMFS volume might be fixed or not depending on the nature and extent of corruption. It depends on case to case basis and customer may decide to restore the datastore from backup.
In such scenario, the storage device which hosts the resource cluster heavily corrupted datastore might need to be detached physically to allow system bootup for maintenance.