Understanding causes of failover in a VCHA system

search cancel

Understanding causes of failover in a VCHA system

book

Article ID: 318740

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

This section will describe some of the most common reasons for a vCenter HA (VCHA) failover, how to detect the cause, and suggested remediation measures.
VCHA creates a three-node cluster that contains Active (N1), Passive (N2), and Witness nodes (N3). A vCenter Server failover can be initiated differently depending on the cause of the failover. This KB will discuss some of the causes of VCHA a failover and remediation steps

Environment

VMware vCenter Server 6.x
VMware vCenter Server 7.x
VMware vCenter Server 8.x

Resolution

Critical Service Crash

When a critical service crashes on N1, the node N2 will assume active role while N1 recovers from the critical service crash.
In this case if you inspect the vmon logs for N1 around the time of failover you will see logs like:

vmon*.log (in /var/log/vmware/vmon directory):
YYYY-MM-DDTHH:MM:SS.657Z| host-9956| D008: <vpxd> Pid 18292 exited with exit code 1
YYYY-MM-DDTHH:MM:SS.657Z| host-9956| W003: <vpxd> Service exited. Exit code 1
YYYY-MM-DDTHH:MM:SS.657Z| host-9956| W003: <vpxd> Service exited unexpectedly. Crash count 0. Taking configured recovery action.
YYYY-MM-DDTHH:MM:SS.657Z| host-9956| I005: SOCKET creating new socket, connecting to /storage/vmware-vmon/vchalistener
YYYY-MM-DDTHH:MM:SS.657Z| host-9956| I005: <vpxd> Initiated VCHA failover for service.

The above indicates the vpxd has crashed and caused a failover.
At the same time you will see in N2's vcha logs that it picks up active role.

vcha*.log on N2 (/var/log/vmware/vcha directory):
YYYY-MM-DDTHH:MM:SS.125Z info vcha[63815] [Originator@6876 sub=ThreadPool] Entering worker thread loop
YYYY-MM-DDTHH:MM:SS.157Z info vcha[03992] [Originator@6876 sub=VchaUtil] Successful /usr/lib/vmware-vcha/scripts/configureVmonSystemdUnit.py
YYYY-MM-DDTHH:MM:SS.157Z info vcha[03992] [Originator@6876 sub=Agent] Successfully picked up primary role

Remediation

It is possible this service crashing represents a bug, and support team will need to investigate the cause of the crash. In order to maximize availability of VC, VCHA fails over to passive node in case of a service crash.
It is recommended to patch your system to latest available VC patches to take advantage of stability fixes delivered in patch and update releases.
If services are crashing often causing frequent failovers, and you want to avoid the failover you can set VCHA to maintenance mode. In maintenance mode, in case of a crash automatic failovers will not occur, but you can chose to failover to passive node manually.

Networking isolation of active

VCHA requires a highly reliable network with less than 10ms latency between its active, passive and witness nodes. If any of the node gets isolated, then the VCHA system becomes unhealthy. Specifically if the active node (N1) gets isolated, then the passive (N2) node will take over as active node.
In this case if you inspect the VCHA logs of N2 node (which picks up the active role) you will see lines such as (vcha*.log on N2):

YYYY-MM-DDTHH:MM:SS.965Z info vcha[04102] [Originator@6876 sub=Election opID=SWI-1] Election State Change Slave => Startup : Lost master
YYYY-MM-DDTHH:MM:SS.965Z info vcha[04102] [Originator@6876 sub=Cluster opID=SWI-1] Change election state to Startup:0
YYYY-MM-DDTHH:MM:SS.965Z verbose vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] Transitioned from election state Slave to Startup
YYYY-MM-DDTHH:MM:SS.965Z info vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] curState 1 lastState 4
YYYY-MM-DDTHH:MM:SS.965Z info vcha[04304] [Originator@6876 sub=Invt opID=SWI-35702e2f] Cluster state changed to Startup
.
.
YYYY-MM-DDTHH:MM:SS.971Z info vcha[04102] [Originator@6876 sub=Election opID=SWI-1] Election State Change Candidate => Master : Master selected
YYYY-MM-DDTHH:MM:SS.971Z info vcha[04102] [Originator@6876 sub=Cluster opID=SWI-1] Change election state to Master:64758875541
YYYY-MM-DDTHH:MM:SS.971Z verbose vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] Transitioned from election state Candidate to Master
YYYY-MM-DDTHH:MM:SS.971Z info vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] curState 6 lastState 5
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] Am now master
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=Invt opID=SWI-6c4f3258] Cluster state changed to Master
YYYY-MM-DDTHH:MM:SS.972Z error vcha[04101] [Originator@6876 sub=Invt opID=SWI-6c4f3258] Could not get product version, treating as CSD unsupported host
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=ClusterMgr opID=SWI-437266f5] MASTER xxx.xxx.xxx.152
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=ClusterMgr opID=SWI-437266f5] Quorum: NO
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=ClusterMgr opID=SWI-437266f5] Node is isolated
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=ClusterMgr opID=SWI-437266f5] Notify isolation
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04101] [Originator@6876 sub=Agent opID=SWI-437266f5] This node is isolated
YYYY-MM-DDTHH:MM:SS.972Z info vcha[04113] [Originator@6876 sub=Agent] Processing event Isolated
YYYY-MM-DDTHH:MM:SS.019Z verbose vcha[04118] [Originator@6876 sub=Cluster] FDM_SM_SLAVE_MSG with id xxx.xxx.xxx.153 (xxx.xxx.xxx.153)
YYYY-MM-DDTHH:MM:SS.019Z verbose vcha[04118] [Originator@6876 sub=Cluster] No UUID for host xxx.xxx.xxx.153
YYYY-MM-DDTHH:MM:SS.019Z verbose vcha[04108] [Originator@6876 sub=Cluster opID=SWI-41a7] ClusterManagerImpl::AddSlave: Adding slave @ xxx.xxx.xxx.153
YYYY-MM-DDTHH:MM:SS.020Z verbose vcha[04305] [Originator@6876 sub=Invt opID=SWI-353f7788] Added host xxx.xxx.xxx.153 to inventory
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04112] [Originator@6876 sub=ClusterMgr opID=SWI-59c09d07] Slave id is : xxx.xxx.xxx.153 0
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04112] [Originator@6876 sub=ClusterMgr opID=SWI-59c09d07] MASTER xxx.xxx.xxx.152
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04112] [Originator@6876 sub=ClusterMgr opID=SWI-59c09d07] Quorum: YES
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04305] [Originator@6876 sub=Invt opID=SWI-353f7788] Host xxx.xxx.xxx.153 changed state: Live
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04112] [Originator@6876 sub=ClusterMgr opID=SWI-59c09d07] Notify Change to master
YYYY-MM-DDTHH:MM:SS.020Z info vcha[04112] [Originator@6876 sub=Agent opID=SWI-59c09d07] Triggering takeover on secondary...
.
.
YYYY-MM-DDTHH:MM:SS.228Z info vcha[04113] [Originator@6876 sub=Agent] Successfully picked up primary role
YYYY-MM-DDTHH:MM:SS.228Z info vcha[04113] [Originator@6876 sub=Agent] Processing event WitnessJoin
YYYY-MM-DDTHH:MM:SS.228Z verbose vcha[04113] [Originator@6876 sub=Agent] Witness joined
YYYY-MM-DDTHH:MM:SS.029Z info vcha[04101] [Originator@6876 sub=VchaUtil] Successful /usr/lib/

At the same time if you inspect the VCHA logs of N1 node (the one that got isolated) you will see lines similar to the following indication N2's take over as primary and N1 being isolated (vcha*.log on N1):

YYYY-MM-DDTHH:MM:SS.183Z error vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Timeout for slave @ xxx.xxx.xxx.152
YYYY-MM-DDTHH:MM:SS.183Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Marking slave xxx.xxx.xxx.152 as unreachable
YYYY-MM-DDTHH:MM:SS.183Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Beginning ICMP pings every 1000000 microseconds to xxx.xxx.xxx.152
YYYY-MM-DDTHH:MM:SS.183Z error vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Timeout for slave @ xxx.xxx.xxx.153
YYYY-MM-DDTHH:MM:SS.183Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Marking slave xxx.xxx.xxx.153 as unreachable
YYYY-MM-DDTHH:MM:SS.183Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Beginning ICMP pings every 1000000 microseconds to xxx.xxx.xxx.153
YYYY-MM-DDTHH:MM:SS.183Z verbose vcha[43268] [Originator@6876 sub=Invt opID=SWI-618fb492] Get 0 VMs on host xxx.xxx.xxx.152 in 1 state
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Slave id is : xxx.xxx.xxx.152 1
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Slave id is : xxx.xxx.xxx.153 1
YYYY-MM-DDTHH:MM:SS.183Z info vcha[43268] [Originator@6876 sub=Invt opID=SWI-618fb492] Host xxx.xxx.xxx.152 changed state: FDMUnreachable
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] MASTER xxx.xxx.xxx.151
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Quorum: NO
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Node is isolated
YYYY-MM-DDTHH:MM:SS.183Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Notify isolation
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=Agent opID=SWI-21980f48] This node is isolated
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[43268] [Originator@6876 sub=Invt opID=SWI-618fb492] Get 0 VMs on host xxx.xxx.xxx.153 in 1 state
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43268] [Originator@6876 sub=Invt opID=SWI-618fb492] Host xxx.xxx.xxx.153 changed state: FDMUnreachable
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Slave id is : xxx.xxx.xxx.152 1
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Slave id is : xxx.xxx.xxx.153 1
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40291] [Originator@6876 sub=Agent] Processing event Isolated
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] MASTER xxx.xxx.xxx.151
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Quorum: NO
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Node is isolated
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=ClusterMgr opID=SWI-21980f48] Notify isolation
YYYY-MM-DDTHH:MM:SS.184Z info vcha[40302] [Originator@6876 sub=Agent opID=SWI-21980f48] This node is isolated
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Waited 5 seconds for icmp ping reply for host xxx.xxx.xxx.152
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] No datastore heartbeat checking => Dead
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] Waited 5 seconds for icmp ping reply for host xxx.xxx.xxx.153
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[40297] [Originator@6876 sub=Cluster opID=SWI-41a7] No datastore heartbeat checking => Dead
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[43239] [Originator@6876 sub=Invt opID=SWI-6135133] Get 0 VMs on host xxx.xxx.xxx.152 in 2 state
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43239] [Originator@6876 sub=Invt opID=SWI-6135133] Host xxx.xxx.xxx.152 changed state: Dead
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Slave id is : xxx.xxx.xxx.152 2
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Slave id is : xxx.xxx.xxx.153 2
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] MASTER xxx.xxx.xxx.151
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Quorum: NO
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Node is isolated
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Notify isolation
YYYY-MM-DDTHH:MM:SS.184Z verbose vcha[43239] [Originator@6876 sub=Invt opID=SWI-6135133] Get 0 VMs on host xxx.xxx.xxx.153 in 2 state
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=Agent opID=SWI-5e37ee62] This node is isolated
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43239] [Originator@6876 sub=Invt opID=SWI-6135133] Host xxx.xxx.xxx.153 changed state: Dead
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Slave id is : xxx.xxx.xxx.152 2
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Slave id is : xxx.xxx.xxx.153 2
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] MASTER xxx.xxx.xxx.151
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Quorum: NO
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Node is isolated
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=ClusterMgr opID=SWI-5e37ee62] Notify isolation
YYYY-MM-DDTHH:MM:SS.184Z info vcha[43294] [Originator@6876 sub=Agent opID=SWI-5e37ee62] This node is isolated

Remediation

The way to fix such a condition is to ensure that the network used for VCHA is reliable and has less than 10ms latency. Flaky network will cause system to react to network outages in order to maximize the availability for VC.

Networking isolation of active and passive

If both active and passive have network outages then both the nodes will be isolated and the current active will shutdown.

Remediation

The remediation for Networking isolation is same in both the cases.

Loss of active node

Due to underlying h/w failures the active node (N1) can be lost (esx host crashing for example). In that case N2 will try to take over as active node.
If you inspect VCHA logs of N2 node you will see logs indicating the same (vcha*.log):

YYYY-MM-DDTHH:MM:SS.190Z info vcha[02683] [Originator@6876 sub=Election opID=SWI-1] Election State Change Slave => Startup : Lost master
YYYY-MM-DDTHH:MM:SS.190Z info vcha[02683] [Originator@6876 sub=Cluster opID=SWI-1] Change election state to Startup:0
YYYY-MM-DDTHH:MM:SS.190Z verbose vcha[02693] [Originator@6876 sub=Cluster opID=SWI-41a7] Transitioned from election state Slave to Startup
YYYY-MM-DDTHH:MM:SS.190Z info vcha[02693] [Originator@6876 sub=Cluster opID=SWI-41a7] curState 1 lastState 4
YYYY-MM-DDTHH:MM:SS.191Z info vcha[03480] [Originator@6876 sub=Invt opID=SWI-fc4111] Cluster state changed to Startup
YYYY-MM-DDTHH:MM:SS.191Z info vcha[02700] [Originator@6876 sub=ClusterMgr opID=SWI-3113c398] MASTER

Remediation

In such cases the logs of N1 node need to be examined, along with logs of the host on which the node N1 was placed to accurately determine the cause of failure.

Feedback

thumb_up Yes

thumb_down No