VMFS extent Offline, causing VMs and Hostd go unresponsive
search cancel

VMFS extent Offline, causing VMs and Hostd go unresponsive

book

Article ID: 323044

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

One or more circumstances do apply to your Situation: 

 

  • Virtual Disk files (= VMDK) residing on the affected Datastore/Disk: I/O might slow down or fail. This might cause Virtual Machines that have VMDKs residing on that device to become unresponsive or even fail.

  • In the vSphere Client (= vCenter) the related Datastore is displaying 0B capacity and 0B free space.
  • The hostd process on one or more ESXi Host(s) may go unresponsive, resulting in the ESXi Host showing as "not responding" in the vSphere Client
    This problem does not occur when a non-head extent of the spanned VMFS datastore fails along with the head extent. In this case, the entire Datastore becomes inaccessible and no longer allows I/Os.
    In contrast, when only a non-head extent fails, but the head extent remains accessible, the Datastore heartbeat appears to be normal. And the I/Os between the Host and the Datastore continue. However, any I/Os that depend on the failed non-head extent start failing as well. Other I/O transactions might accumulate while waiting for the failing I/Os to resolve and cause the Host to enter the non-responding state ( Reference ) .

  • If the affected VMFS Datastore capacity was expanded by adding an extent and the extent is offline, VMFS expansion will fail. 

  • Storage Adapter Rescan fails with the error "An error occurred while communicating with remote host"

 

  • var/run/log/vmkwarning.log shows the following:

YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu##:###### opID=#####)WARNING: LVM: 17711: An attached device went offline. ##############:1 file system [VOLUME-NAME, VOLUME-UUID]

 

  • var/run/log/vobd.log shows the following:

YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]:  [vmfsCorrelator] 13900876943838us: [vob.vmfs.extent.offline] An attached device went offline. ##############:1 file system [VOLUME-NAME, VOLUME-UUID]
YYYY-MM-DDTHH:MM.SSSZ In(14) vobd[2098027]:  [vmfsCorrelator] 13900800936747us: [esx.problem.vmfs.extent.offline] An attached device ##############:1 may be offline. The file system [VOLUME-NAME, VOLUME-UUID] is now in a degraded state. While the datastore is still available, parts of data that reside on the extent that went offline might be inaccessible.

  • Various Host logs do show messages referring to "Address temporarily unmapped".

Examples:

YYYY-MM-DDTHH:MM.SSSZ In(182) vmkernel: cpu47:2812483)Fil6: 4289: 'DATASTORE-NAME': Fil6 file IO (<FD c52 r1>) : Address temporarily unmapped

YYYY-MM-DDTHH:MM.SSSZ Db(167) Hostd[2099105]: --> Failed to copy source (/vmfs/volumes/##############/##########/########.vmdk) to destination (/vmfs/volumes/##############/##########/########.vmdk): Address temporarily unmapped.

YYYY-MM-DDTHH:MM.SSSZ Wa(180) vmkwarning: cpu21:3221033)WARNING: SVM: 2891: scsi##### Failed SVMFDSIoctlMoveData: Address temporarily unmapped

 

Environment

ESXi 8.x
ESXi 7.x

Cause

One or more Storage devices backing the affected Datastore might be offline / not functioning properly.

The esx.problem.vmfs.extent.offline message is received when an ESXi Host loses connection to a Storage device that backs an VMFS Datastore or any of its Extents.

This loss of connection can happen when a switch or cable that connects the device to the ESXi Host is disconnected or when the Device is reformatted to be used by another Volume.

Resolution

Identify the Storage device(s) backing the affected VMFS Datastore and restore connectivity to these Storage device(s) by following the steps below:
 
Note:
If the Storage device(s) has been reformatted and reassigned to another Volume, the corresponding portion of the original Volume will be permanently lost and cannot be recovered.

 

1.) Run the following command to determine which Storage devices back the affected Datastore (= VMFS volume ):

vmkfstools -Ph /vmfs/volumes/<datastore>

 
Example: 
 
[root@xxxxx:~] vmkfstools -Ph /vmfs/volumes/<datastore>
VMFS-6.82 (Raw Major Version: 24) file system spanning 4 partitions.
File system label (if any): datastore
Mode: public
Capacity 399.8 GB, 156.2 GB available, file block size 1 MB, max supported file size 64 TB
Disk Block Size: 512/16384/0
UUID: ########-########-####-############
Partitions spanned (on "lvm"):
        #################:1 ----------> First Partition: Head extent.
        #################:1
        #################:1
        #################:1
Is Native Snapshot Capable: NO
 
 
Note: In this scenario, the Datastore is configured with multiple Extents. As a result, multiple Storage devices (= ################# ) appear under "Partitions spanned (on 'lvm')" indicating that multiple Devices are backing this Datastore.

 

 

2.) Verify whether this is a local Datastore or non-local (= located on SAN Array) by running the following command:

esxcli storage core device list

Look in the output for the Storage device(s) ################# listed under "Partitions spanned (on "lvm")" 

Example:

#################:
   Display Name: Local Make Disk (#################)
   Has Settable Display Name: true
   Size: 3662830
   Device Type: Direct-Access
   Multipath Plugin: HPP
   Devfs Path: /vmfs/devices/disks/#################
   Vendor: "Vendor Name"
   Model: ######
   Revision: ######
   SCSI Level: 6
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: true

Check for "Is Local:" from the above output:
Is Local: true    -------> This is a local Storage device (= affected Datastore is local to the Host)
Is Local: false   -------> This is a Storage device which is not local to the Host. The Storage device is located on SAN (= SAN LUN / external to the Host). The affected Datastore is located on SAN.
 
 

3.) If one or more of the Storage device(s) ################# listed under "Partitions spanned (on "lvm")" do show Is Local: true, identify their physical location: 

esxcli storage core device physical get -d #################
Physical Location: enclosure ### slot ###

 

4.) Check the Health of the Storage device(s) ################# listed under "Partitions spanned (on "lvm")"

5.) Involve your HW Storage Vendor to identify and fix any Storage device issues or have them check the Health of the Storage device(s)

6.) If the issue continues after resolving the underlying Storage problem, restart the affected ESXi Host(s)