In a VMware Cloud Foundation or standalone vSphere environment, the vCenter Server High Availability (VCHA) cluster experiences frequent, unplanned failovers. While the vCenter services are operational, the Active and Passive nodes continuously swap roles, leading to intermittent management unavailability.
vCenter Server management interface (FQDN) is intermittently unreachable.
The vCenter HA monitoring page shows frequent state changes.
Log files (/var/log/vmware/vcha/vcha.log) contain entries such as:
Slave timed out
Lost master
Startup Timeout
7.x
8.x
Verify Network Latency: Log in to the VCSA shell of the Active node and ping the Passive and Witness node heartbeat IPs:
# Replace <Peer_IP> with the heartbeat IP of the Passive or Witness node
vmkping -I eth1 <Peer_IP>
Ensure Round Trip Time (RTT) is consistently below 10 ms.
Check for Packet Loss: Run a prolonged ping to identify intermittent drops:
ping <Peer_IP> -c 100
Any packet loss greater than 0% can trigger a failover.
Validate Network Configuration:
Ensure the VCHA NICs (typically eth1) are on a dedicated, isolated VLAN.
Confirm that the MTU settings are consistent across the virtual switches and physical switches.
Ensure no firewall or physical security appliance is inspecting and delaying heartbeat traffic.
Stabilize via Maintenance Mode (Workaround): If the network cannot be fixed immediately, put VCHA into Maintenance Mode to prevent further automated failovers:
Go to vCenter > Configure > vCenter HA.
Click Edit and select Maintenance Mode. This keeps replication active but prevents automatic failover.