[Alarm 'Host connection failure' on <hostname>.example.com triggered by event 'Host <hostname>.example.com in <Datacenter> is not responding']
/var/run/log/fdm.log
) on the impacted host shows as "isolated true
" but within few seconds it changes the status to "isolated false
"YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is true
YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Cluster opID=clusterElection.cpp] Connected to master @ host-<moid>
YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is false
YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is true
YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Cluster opID=clusterElection.cpp] Connected to master @ host-<moid>
YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is false
This issue is observed when the "vmnic" used for vSphere HA heartbeat is frequently flapping and connection is getting disconnected and restored within few seconds. In a VSAN enabled Cluster, vSphere HA will use the storage network for HA heartbeat.
This is an expected behaviour of vSphere HA when the NIC is frequently flapping. When a host is isolated, it will wait for 30 seconds to initiate the isolation actions. vSphere HA will not failover the VMs if the NIC is frequently flapping and the FDM heartbeat is getting restored within 30 seconds.