vSAN Cluster Hosts Repeatedly Losing and Recovering Access to Storage Volumes

Products

VMware vSAN

Issue/Introduction

Multiple ESXi hosts within a vSAN cluster intermittently log events indicating they have lost and subsequently recovered access to storage volumes.

The vSphere Client and vCenter events display repeated connectivity warnings:

Lost access to volume 5acfc497-633bf9de-b5c0-############ (97c4cf5a-dd53-07b1-d372-############) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly. Information 05/23/2025, 5:19:38 AM Host1
Successfully restored access to volume 5acfc497-633bf9de-b5c0-############ (97c4cf5a-dd53-07b1-d372-############) following connectivity issues. Information 05/23/2025, 5:19:40 AM Host1

The vmkernel.log on the affected hosts records high latency and VMFS heartbeat timeouts:

2025-05-18T11:53:01.588Z CMMDS: LeaderUpdateMeanRTLatency:12333: Throttled: 529c7cd4-6a43-ab4c-85b8-############: High RT latency. Node 00000000-0000-0000-0000-############, RT latency 5382(ms). Mean RT latency 337(ms)
2025-05-18T11:53:16.641Z HBX: 5765: Reclaiming HB at 3645440 on vol '6283e45a-0cf0-c643-6ca0-############' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID: (00000000-00000000-0000-000000000000).
2025-05-18T11:53:16.643Z HBX: 294: '6283e45a-0cf0-c643-6ca0-############': HB at offset 3645440 - Reclaimed heartbeat [Timeout]:

The vobd.log confirms the heartbeat timeout issues across multiple nodes:

2025-05-18T11:53:05.437Z: [vob.vmfs.heartbeat.timedout] 5ae48362-736aefde-ea80-############ 6283e45a-0cf0-c643-6ca0-############
2025-05-18T11:53:05.437Z: [esx.problem.vmfs.heartbeat.timedout] 5ae48362-736aefde-ea80-############ 6283e45a-0cf0-c643-6ca0-############
2025-05-18T11:53:16.643Z: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume 5ae48362-736aefde-ea80-############ (6283e45a-0cf0-c643-6ca0-############)
2025-05-18T11:53:16.644Z: [esx.problem.vmfs.heartbeat.recovered] 5ae48362-736aefde-ea80-############ 6283e45a-0cf0-c643-6ca0-############

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware VSAN (All Versions)

Cause

Physical network infrastructure instability within the datacenter is causing packet loss, frame length errors, and CRC errors. This drops vSAN and storage heartbeat traffic between the hosts, directly resulting in the datastore accessibility timeouts and high latency logged by ESXi.

Resolution

Verify the presence of hardware-level network errors on the ESXi hosts by checking network statistics via SSH using the command: esxcli network nic stats get -n vmnic#
Check for incrementing values in the Receive errors, Transmit errors, Length errors, CRC errors, or Over errors counters.
Engage the local datacenter networking team to investigate the physical network path. This includes inspecting Top-of-Rack (ToR) switches, physical cables, SFPs, and switchport configurations for faults or congestion.
Once the networking team identifies and resolves the physical infrastructure fault, verify that the NIC error stats have stopped incrementing.
Monitor the vSphere Client to ensure no further "Lost access to volume" events are generated.

Additional Information

For more information regarding vSAN network requirements and troubleshooting physical network statistics, refer to the Broadcom TechDocs: Troubleshooting the vSAN Network