Understanding lost access to volume messages in ESXi

Products

VMware vSphere ESX 8.x

Issue/Introduction

This article provides information on understanding the lost access to volume related messages in ESXi.

The VMFS datastores are monitored through the heartbeats that are issued in the form of write operations approximately once in every 3 seconds to the VMFS volumes from the hosts. Each ESXi host accessing the VMFS datastores expects this heartbeat write I/O operations to complete within an 8-second window. If the heartbeat I/O does not complete within an 8-second window, the I/O is timed out and a subsequent heartbeat I/O is issued. If the total time of the heartbeat I/O does not complete within a 16 second window, the datastore is marked offline and a Lost access to volume log message is generated by hostd to reflect this behavior.

After a VMFS datastore is marked in an offline state, ESXi issues heartbeat I/O to the datastore approximately every 1 second until connectivity is restored. If a heartbeat I/O completes, the datastore is marked back online and host I/O is allowed to continue.

Symptoms:

Virtual machines display as inaccessible.

In the /var/log/hostd.log file, find entries similar to:

yyyy-mm-ddThh:mm:ss [4F1E1B70 info 'Vimsvc.ha-eventmgr'] Event 205 : Lost access to volume 54f89e21-########-####-##########98 (228.154.ds3) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.
yyyy-mm-ddThh:mm:ss [4F480B70 info 'Vimsvc.ha-eventmgr'] Event 210 : Successfully restored access to volume 54f89e21-########-####-##########98 (example datastore) following connectivity issues

In the /var/log/vobd.log file, find entries similar to:

yyyy-mm-ddThh:mm:ss: [vmfsCorrelator] 115715089142us: [esx.problem.vmfs.heartbeat.timedout] 54f89e21-########-####-##########98 example datastore
yyyy-mm-ddThh:mm:ss: [vmfsCorrelator] 115740470730us: [esx.problem.vmfs.heartbeat.recovered] 54f89e21-########-####-##########98 example datastore

In the /var/log/vmkernel.log file, find entries similar to:

yyyy-mm-ddThh:mm:ss cpu10:36273)HBX: 2832: Waiting for timed out [HB state abcdef02 offset 3444736 gen 549 stampUS 115704005679 uuid 5592d754-21d7d8a7-0a7e-##########98 jrnl <FB 779600> drv 14.60] on vol 'example datastore'
yyyy-mm-ddThh:mm:ss cpu26:32873)HBX: 258: Reclaimed heartbeat for volume 54f89e21-########-####-##########98 (example datastore): [Timeout] Offset 3444736

In vCenter Server, reference events similar to:

Lost access to volume 54f89e21-########-####-##########98 (example datastore) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on environment.

Environment

VMware vSphere ESXi 8.x

Resolution

To determine why the heartbeat I/O operations never complete:

Note the date/time when the lost access to volume message was reported and check the ESXi host logs for related information.
Verify that there are no connectivity issues between the ESXi host and the storage device.

Troubleshooting network connectivity issues depends on how storage is connected. For more information, see Troubleshooting LUN connectivity issues on ESXi hosts

If Lost connection to volume warnings are reported repeatedly at certain times, or at regular intervals, check in addition, whether there any I/O intensive scheduled tasks which may be degrading I/O performance to the point that datastore heartbeat I/Os time out, e.g.:

VM backups
VDI deskop deployments or deletions
trim operations (Linux)/ disk optimization Windows (which generate unmap commands)
storage array background operations
tasks implemented via API, scripts or cronjobs.

Changing the scheduling or distribution of such tasks may prevent loss of connection to volumes.

Additional Information

When the volume is in the lost access to volume state, host I/O is blocked until the heartbeat I/O can be completed. When the first heartbeat time out generates, its possible to issue subsequent heartbeat reclaim operations to the datastore until the heartbeat can be recovered. The reclaim occurs approximately once every second. Guest operating system should remain online as long as it can sustain the long latency periods of these I/O operations to the VMDK. Until the heartbeat is reclaimed, VMFS fails all virtual machine I/O operations from virtual machines residing on the impacted datastore with a DEVICE BUSY status.

For more information, see: