VMFS extent Offline, causing VMs and Hostd go unresponsive

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

Symptoms:

I/O to virtual disk (VMDK) files that reside on the offline device might slow down or fail. This might cause virtual machines that have VMDKs residing on that device to become unresponsive or even fail.
Hostd may go unresponsive.

Note: This problem does not occur when a non-head extent of the spanned VMFS datastore fails along with the head extent. In this case, the entire datastore becomes inaccessible and no longer allows I/Os.

In contrast, when only a non-head extent fails, but the head extent remains accessible, the datastore heartbeat appears to be normal. And the I/Os between the host and the datastore continue. However, any I/Os that depend on the failed non-head extent start failing as well. Other I/O transactions might accumulate while waiting for the failing I/Os to resolve and cause the host to enter the non-responding state.

See, for example:

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/release-notes/vcenter-server-update-and-patch-releases/vsphere-vcenter-server-70u3p-release-notes.html
If the VMFS datastore capacity was expanded by adding an extent and the extent is offline, VMFS expansion will fail.
Storage adapter rescan fails with the error "An error occurred while communicating with remote host".
In the var/run/log/vmkernel.log file, similar entries are seen:

YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:2100497 opID=55dbbda8)WARNING: LVM: 17711: An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:2100512 opID=b09639a4)WARNING: LVM: 17711: An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:2100491 opID=26701d75)WARNING: LVM: 17711: An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:2100491 opID=26701d75)WARNING: LVM: 17711: An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:2100509 opID=c2db1724)WARNING: LVM: 17711: An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]

In the var/run/log/vobd.log file, similar entries are seen:

YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]: [vmfsCorrelator] 13900876943838us: [vob.vmfs.extent.offline] An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]: [vmfsCorrelator] 13900800936747us: [esx.problem.vmfs.extent.offline] An attached device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 may be offline. The file system [VOLUME-NAME, VOLUME-UUID] is now in a degraded state. While the datastore is still available, parts of data that reside on the extent that went offline might be inaccessible.
YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]: [vmfsCorrelator] 13900887878275us: [vob.vmfs.extent.offline] An attached device went offline. naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]: [vmfsCorrelator] 13900811870939us: [esx.problem.vmfs.extent.offline] An attached device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx may be offline. The file system [VOLUME-NAME, VOLUME-UUID] is now in a degraded state. While the datastore is still available, parts of data that reside on the extent that went offline might be inaccessible.
YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]: [vmfsCorrelator] 13900966570083us: [esx.problem.vmfs.extent.offline] An attached device naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:1 may be offline. The file system [VOLUME-NAME, VOLUME-UUID] is now in a degraded state. While the datastore is still available, parts of data that reside on the extent that went offline might be inaccessible.

Validation step 1:

Run the following command to determine which devices back the VMFS volume and check if any devices are offline:

vmkfstools -Ph /vmfs/volumes/<datastore>

Example:

[root@xxxxx:~] vmkfstools -Ph /vmfs/volumes/<datastore>
VMFS-6.82 (Raw Major Version: 24) file system spanning 4 partitions.
File system label (if any): datastore
Mode: public
Capacity 399.8 GB, 156.2 GB available, file block size 1 MB, max supported file size 64 TB
Disk Block Size: 512/16384/0
UUID: ########-########-####-############
Partitions spanned (on "lvm"):
naa.################################:1 ----------> Head extent.

naa.################################:1

Is Native Snapshot Capable: NO

The 1st partition is the head extent.

Note: In this scenario, the datastore is configured with multiple extents. As a result, multiple naa.id entries appear under "Partitions spanned (on 'lvm')," indicating that multiple devices are backing this datastore.

Validation Steps 2.

Verify if this is a local datastore or SAN device.
- Run the following command and identify using device naaid

esxcli storage core device list

naa.xxxxxxxxxxxxxxxx:
Display Name: Local Make Disk (naa.xxxxxxxxxxxxxxxxxx)
Has Settable Display Name: true
Size: 3662830
Device Type: Direct-Access
Multipath Plugin: HPP
Devfs Path: /vmfs/devices/disks/naa.xxxxxxxxxxxxxxxx
Vendor: "Vendor Name"
Model: XXX5XVUG3T84
Revision: B70C
SCSI Level: 6
Is Pseudo: false
Status: on
Is RDM Capable: true
Is Local: true

Check for "Is Local:" from the above output. if the value states true, then this is a local disk, if false then this is SAN LUN.

Environment

VMware vSphere ESXi 8.x
VMware vSphere ESXi 7.x

Cause

The esx.problem.vmfs.extent.offline message is received when an ESXi host loses connection to a storage device that backs an entire VMFS datastore or any of its extents.
This loss of connection can happen when a switch or cable that connects the device to the ESXi host is disconnected or when the device is reformatted to be used by another volume.

Resolution

Identify the devices that are affected and restore connectivity. If the device has been reformatted and reassigned to another volume, the corresponding portion of the original volume will be permanently lost and cannot be recovered.
If multi extent is configured over local datastore, the run the following command to identify the physical location of the disk.

esxcli storage core device physical get -d <device Naaid>

[root@xxxxx:~] esxcli storage core device physical get -d naa.xxxxxxxxxxxxxxx
Physical Location: enclosure 12 slot 4
Involve storage vendor to identify and fix the underlying storage / disk issue.
If the issue continues after resolving the underlying storage problem, restart the host.