vSphere HA Agent State Flapping (Election/Connected/Unreachable) in vSphere 8.x vSAN Cluster
search cancel

vSphere HA Agent State Flapping (Election/Connected/Unreachable) in vSphere 8.x vSAN Cluster

book

Article ID: 429158

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

In a vSphere 8.0U3 cluster with vSAN enabled, you observe the vSphere HA (High Availability) state on multiple ESXi hosts continuously flapping between Uninitialized --> Election --> Connected --> Agent Unreachable --> Election (Repeats).

  • Other cluster's managed by the same vCenter server have stable vSphere HA status.

  • Issue persisted after performing "Reconfigure for vSphere HA" on ESXi host's and toggling vSphere HA (Disable/Enable) at the cluster level.

  • Network validation confirms MTU 9000 is consistent, and Port 8182 (HA Agent) connectivity is open and reachable between hosts.

  • vSAN health score is below 60%.

  • The /var/run/log/vobd.log on multiple ESXi host's shows repeated vSAN checksum error corrections:

vobd[2098148]:  [vSANCorrelator] ####us: [vob.vsan.dom.singlediskerrorfixed] vSAN detected and fixed a medium or checksum error for component #####-####-######-######## on disk #####-####-######-########.
vobd[2098148]:  [vSANCorrelator] ####us: [vob.vsan.dom.singlediskerrorfixed] vSAN detected and fixed a medium or checksum error for component #####-####-######-######## on disk #####-####-######-########.
 

Environment

VMware vSphere 8.X

vSAN 8.X

Cause

This issue occurs due to the architectural dependency of vSphere HA on the vSAN network stack. When vSphere HA is enabled on a vSAN cluster, HA inter-agent traffic relies on the vSAN logical network.

Techdoc for reference: Using vSAN and vSphere HA

The presence of Inaccessible Objects and active Checksum Errors indicates that the vSAN Distributed Object Manager (DOM) and the underlying storage network could be experiencing contention. These factors combined cause latency or packet loss for the HA heartbeats, leading the FDM (Fault Domain Manager) agent to incorrectly assume the host is isolated or partitioned, resulting in the election/connected/unreachable loop.

Resolution

Engage Broadcom Support for assistance on removing the in-accessible vSAN objects. Once the objects are removed, verify whether the vSAN Health Score improves (e.g., > 90).

Monitor the vSphere HA state for the cluster; it should stabilize to "Connected" without further intervention. If the state remains unstable, disable and re-enable vSphere HA on the cluster.