ESXI version 8.0, Prevent Auto-Mounting Datastores With Severely Corrupted Resource Clusters

Products

VMware vSphere ESXi

Issue/Introduction

The purpose of this article is to describe a change automatic mounting behavior for heavily corrupted VMFS-6 datastores post ESXi 8.0

In ESXi versions earlier than ESXi 8.0, all the datastores will be mounted automatically during system bootup. If the resource cluster meta data of one datastore is severely damaged, the ESXi server might crash while loading the meta data and the ESXi server might be left in unusable state. Though this issue could happen for both VMFS5 and VMFS6 datastores, the mounting behavior change only applies for VMFS6 datastores.

With this change, heavily corrupted VMFS6 datastores will not be mounted automatically during boot process, which makes ESXi server being able to complete booting and be available for maintenance or accessing remaining healthy datastores. This change aims at improving system availability and helps customer when severe corruption happens because of any reasons.

Symptoms:

Customer may find some VMFS datastores not getting auto-mounted
ESXi vmkernel log might show resource cluster meta data corruption messages as following :

YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)WARNING: Res3: StatVMFS6:7090: Volume  __DATASTORE_UUID__("__DATASTORE_NAME__") might be damaged on the disk. Resource cluster metadata corruption has been detected.
YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)WARNING: FS3: LogCorruption:628: VMFS volume __DATASTORE_NAME__/__DATASTORE_UUID__ on naa.XXXXXXXXXX:1 has been detected corrupted
YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)FS3: LogCorruption:631: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)FS3: LogCorruption:654: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.XXXXXXXXXX:1 -D X`
YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)FS3: LogCorruption:657: where X is the dump file name on a DIFFERENT volume
YYYY-MM-DDT##:##:##.###Z cpu9:1001390796)FS3: ReportCorruption:703: On disk volume descriptor verified

Prior to ESXi 8.0 the ESXi server might crash after logging the corruption. One example of the system panic stack is:

0x45391639a940:[0x420018995265]PanicvPanicInt@vmkernel#nover+0x8d stack: 0x0
0x45391639a9d0:[0x4200189951d5]Panic_NoSave@vmkernel#nover+0x56 stack: 0x45391639aa30
0x45391639aa30:[0x4200189960b0]Panic_OnAssertAt@vmkernel#nover+0xbd stack: 0x420019499f0c
0x45391639aab0:[0x420018a15f87]Int6_UD2Assert@vmkernel#nover+0x6c stack: 0x0
0x45391639aad0:[0x420018a0c076]gate_entry@vmkernel#nover+0x77 stack: 0x8001003d
0x45391639ab90:[0x42001a609811]Res3StatVMFS6@esx#nover+0x187d stack: 0xcaaf1a0000
0x45391639aea0:[0x42001a5c4174]Vol3ExtendJournalVMFS6@esx#nover+0xf9 stack: 0x3
0x45391639af30:[0x42001a5c4585]Vol3JournalExtendManager@esx#nover+0x72 stack: 0x2f
0x45391639af50:[0x4200189c7d47]vmkWorldFunc@vmkernel#nover+0x64 stack: 0x4200189c7d42
0x45391639afb0:[0x420018e9ee59]CpuSched_StartWorld@vmkernel#nover+0xfe stack: 0x45390c09f000
0x45391639afe0:[0x420018f0ef00]CpuSched_ArchStartWorld@vmkernel#nover+0x21 stack: 0x0
0x45391639b000:[0x42001893c3fb]Debug_IsInitialized@vmkernel#nover+0x24 stack: 0x0
lastClrIntrRA:0x4200189a7698 2022-06-21T19:20:42.797Z cpu9:1001390796)base fs=0x0 gs=0x420042400000 Kgs=0x0

Note: The above task is for the example. The issue is not restricted to the above stack.

Environment

VMware vSphere ESXi 8.0.0

Cause

VMFS meta data corruption can happen because of various reasons. The goal is to stop mounting a heavily corrupted VMFS datastore to avoid ESXi host getting into reboot loop.

Resolution

The change is integrated in ESXi 8.0

Based on the nature of corruption we can use VOMA tool to fix the corruption and mount the datastore again.
Please contact VMware support team to perform metadata check, refer to KB: Using vSphere On-disk Metadata Analyzer, on how to use VOMA tool.

Note: Metadata inconsistency in the VMFS volume might be fixed or not depending on the nature and extent of corruption. It depends on case to case basis and customer may decide to restore the datastore from backup.

Workaround:

In such scenario, the storage device which hosts the resource cluster heavily corrupted datastore might need to be detached physically to allow system bootup for maintenance.

Additional Information

Impact/Risks:

The change avoid mounting the corrupted datastore.
It helps ESXi host complete it's booting process.