After upgrading to ESXi 7.0U2, corruption can occur on VMFS datastores if the ESXi hosts sharing those LUNs had their boot devices cloned

search cancel

After upgrading to ESXi 7.0U2, corruption can occur on VMFS datastores if the ESXi hosts sharing those LUNs had their boot devices cloned

book

Article ID: 318630

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
In /var/log/vmkernel.log, the follow event is observed:

[YYYY-MM-DDTHH:MM:SS] cpu2:2100630) [HB state abcdef04 offset 3444736 gen 145 stampUS 807518598 uuid 60bf9bed-3092e266-2e35-0025b5920a03 jrnl <FB 6210136> drv 14.81 lockImpl 4 ip 10.8.68.207]
[YYYY-MM-DDTHH:MM:SS] cpu2:2100630)FS3J: 4381: Replaying journal at <type 1 addr 6210136>, gen 145
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)HBX: 4720: 3 stale HB slot(s) owned by me have been garbage collected on vol 'DatastoreA'
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)WARNING: FS3: 608: VMFS volume DatastoreA/59aa5959-aabbccdd-5959-59aa5959aa59 on naa.60060060660########60606:1 has been detected corrupted
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 638: and upload the dump by `dd if=/vmfs/devices/disks/naa.xxxxxxxxxxxxxxx:1 of=X bs=1M count=1200 conv=notrunc`
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 641: where X is the dump file name on a DIFFERENT volume
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 319: FS3RCMeta 3881 200 1 67 0
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 326: 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 332: 0 0 0
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)FS3: 346: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[YYYY-MM-DDTHH:MM:SS] cpu18:2100630)WARNING: FS3J: 2240: Error freeing journal block <FBA tbz 0 cow 0 blk 776267> (returned 0) for 59de4594-ceca2bd1-1832-0025b5920a02: Invalid metadata
2021-06-08T16:35:36.802Z cpu18:2100630)WARNING: HBX: 3820: Cannot free journal <type 1 addr 6210136> on vol 'DatastoreA'

Environment

VMware vSphere ESXi 7.0.x

Cause

When an ESXi boot device is cloned, the System Universal Unique Identifier (UUID) is also cloned. This identifier is used for VMFS Heartbeat and Journal operations, so if multiple hosts have the same UUID, this can lead to a split-brain situation as the ESXi hosts will attempt to access each other's metadata regions on VMFS.

The most common form of cloned ESXi boot devices is cloned boot LUNs for rapid deployments.

Resolution

Cloning ESXi boot devices is not supported. While this may have worked successfully in previous versions of ESXi, there are additional dependencies on the System UUID being unique from ESXi 7.0 U2 moving forward. See Statement about supportability of cloning ESXi boot devices for deployments, https://knowledge.broadcom.com/external/article?legacyId=84280

Workaround:
If ESXi hosts have cloned boot devices in the environment, there is a 4 step process to change the System UUID on each server so that it will be unique. This process will only work on hosts that have not been upgraded to ESXi 7.0 U2 yet. If hosts have already been upgraded to 7.0 U2 then the only supported solution is rebuild those hosts.

Note: This will not work on the original host with the correct MAC address in the UUID.

1. There is an advanced ESXi setting called FollowHardwareMac that will automatically update the VMkernel's MAC Address whenever the network adapter MAC Addresses changes. To do so, run the following ESXCLI command:

$ esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1

2. Next, delete the existing System UUID entry in /etc/vmware/esx.conf. This will ensure a new System UUID will be generated when the system boots up. To do so, open esx.conf and delete the entire /system/uuid line entry and then save the file.

3. To ensure that this change persists, run the following command:

$ /sbin/auto-backup.sh

4. Reboot the ESXi host to generate the new System UUID. Verify that the System UUID has actually changed from the original.

Note: All datastores affected by corruption will need to be reformatted to clear the corruption. This should be done AFTER changing the UUID on ALL ESXi hosts otherwise corruption will continue.

Feedback

thumb_up Yes

thumb_down No