vCenter Server High Availability (VCHA) nodes flapping between Active and Passive states due to network instability
search cancel

vCenter Server High Availability (VCHA) nodes flapping between Active and Passive states due to network instability

book

Article ID: 430364

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

In a VMware Cloud Foundation or standalone vSphere environment, the vCenter Server High Availability (VCHA) cluster experiences frequent, unplanned failovers. While the vCenter services are operational, the Active and Passive nodes continuously swap roles, leading to intermittent management unavailability.

  • vCenter Server management interface (FQDN) is intermittently unreachable.

  • The vCenter HA monitoring page shows frequent state changes.

  • Log files (/var/log/vmware/vcha/vcha.log) contain entries such as:

    • Slave timed out

    • Lost master

    • Startup Timeout

Environment

7.x
8.x

Cause

  • The issue is caused by network instability or high latency on the dedicated VCHA Heartbeat network.
  • VCHA relies on a heartbeat mechanism between the Active, Passive, and Witness nodes.
  • If the Passive node fails to receive a heartbeat within the timeout period (typically due to packet loss or RTT latency exceeding 10 ms), it assumes the Active node is down and attempts to take over the Active role. If connectivity is restored and then lost again, "flapping" occurs.

 

Resolution

  1. Verify Network Latency: Log in to the VCSA shell of the Active node and ping the Passive and Witness node heartbeat IPs:

    # Replace <Peer_IP> with the heartbeat IP of the Passive or Witness node
    vmkping -I eth1 <Peer_IP> 
    

    Ensure Round Trip Time (RTT) is consistently below 10 ms.

  2. Check for Packet Loss: Run a prolonged ping to identify intermittent drops:

    ping <Peer_IP> -c 100
    

    Any packet loss greater than 0% can trigger a failover.

  3. Validate Network Configuration:

    1. Ensure the VCHA NICs (typically eth1) are on a dedicated, isolated VLAN.

    2. Confirm that the MTU settings are consistent across the virtual switches and physical switches.

    3. Ensure no firewall or physical security appliance is inspecting and delaying heartbeat traffic.

  4. Stabilize via Maintenance Mode (Workaround): If the network cannot be fixed immediately, put VCHA into Maintenance Mode to prevent further automated failovers:

    1. Go to vCenter > Configure > vCenter HA.

    2. Click Edit and select Maintenance Mode. This keeps replication active but prevents automatic failover.

 

Additional Information

Refer Deploying vCenter High Availability with network addresses in separate subnets - vSphere 6.5