vSphere HA did not failover the VMs when the ESXi host was isolated from the network due to frequent NIC flapping
search cancel

vSphere HA did not failover the VMs when the ESXi host was isolated from the network due to frequent NIC flapping

book

Article ID: 388068

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • ESXi stopped responding on vCenter Server and all the VMs shows as disconnected.

    [Alarm 'Host connection failure' on <hostname>.example.com triggered by event 'Host <hostname>.example.com in <Datacenter> is not responding']

  • Even though the ESXi host and the Virtual Machines are not responding on the host, vSphere HA did not failover the VMs to other host till the failed ESXi host was forcefully restarted.
  • vSphere HA logs (/var/run/log/fdm.log) on the impacted host shows as "isolated true" but within few seconds it changes the status to "isolated false"

    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is true
    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Cluster opID=clusterElection.cpp] Connected to master @ host-<moid>
    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is false

    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is true
    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Cluster opID=clusterElection.cpp] Connected to master @ host-<moid>
    YYYY-MM-DDTHH:MM:SSZ In(166) Fdm[pid]: [Originator@6876 sub=Policy opID=clusterManager.cpp] Host isolated is false

Cause

This issue is observed when the "vmnic" used for vSphere HA heartbeat is frequently flapping and connection is getting disconnected and restored within few seconds. In a VSAN enabled Cluster, vSphere HA will use the storage network for HA heartbeat.

Resolution

This is an expected behaviour of vSphere HA when the NIC is frequently flapping. When a host is isolated, it will wait for 30 seconds to initiate the isolation actions. vSphere HA will not failover the VMs if the NIC is frequently flapping and the FDM heartbeat is getting restored within 30 seconds.