vSphere HA Failover Triggered by Sustained Physical NIC Flapping During Network Maintenance

search cancel

vSphere HA Failover Triggered by Sustained Physical NIC Flapping During Network Maintenance

book

Article ID: 438264

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Unexpected vSphere HA failover events occur for multiple Virtual Machines (VMs).
ESXi hosts intermittently transition through states: Live > Unreachable > Dead > FDMUnreachable.
VMs are migrated to other hosts or rebooted automatically by vSphere HA.
vCenter Server reports "Host connection failure" or "Host is not responding" alarms.
The HA failover event may have been triggered by a reboot of a leaf or core physical switch.
The network maintenance may have resulted in missed storage heartbeats and APDs (All Paths Down) to the vSAN datastore.

Environment

VMware ESXi

Cause

This issue occurs when physical network infrastructure changes (such as firewall or DNS maintenance) trigger physical NIC link flaps across multiple hosts in the cluster. If a host's management uplink (e.g., vmnic0) remains in a Link Down state for a prolonged period (typically exceeding 30 seconds), the vSphere HA Master host declares that host as Dead and initiates failover actions for its protected VMs.

Resolution

1. Verify Physical Link Flaps

Review the vmkernel.log on the affected ESXi hosts to identify the exact timestamps of link state changes:

YYYY-MM-DDTHH:MM:SSZ cpu<ID>:netschedHClk: NetSchedHClkNotify: vmnic0: link down notification
YYYY-MM-DDTHH:MM:SSZ cpu<ID>:netschedHClk: NetSchedHClkNotify: vmnic0: link up notification

2. Analyze vSphere HA (FDM) Logs

Examine the fdm.log on the Master host to correlate the link loss with the host state change to Dead:

YYYY-MM-DDTHH:MM:SSZ info fdm[PID] [Originator@6876 sub=Invt] Host host-<ID> changed state: Dead
YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=Placement] Issue failover start event for <#> Vms

3. Review VM Restart Events

Confirm the placement and restart of VMs on surviving hosts:

YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=Execution] Place /vmfs/volumes/<UUID>/<VM_NAME>/<VM_NAME>.vmx on host-<ID>
YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=FDM] New event: EventEx=com.vmware.vc.ha.VmRestartedByHAEvent vm=/vmfs/volumes/<VOLUME>/<VM_NAME>.vmx host=host-<ID>

Workaround/Prevention:

Network Redundancy: Ensure hosts have redundant physical uplinks for the Management Network. If one link flaps, the secondary link should maintain the HA heartbeat.
Maintenance Coordination: When performing network or firewall changes, consider placing the affected ESXi cluster into Maintenance Mode or temporarily disabling vSphere HA to prevent unnecessary failover events during brief, expected disconnections.
Advanced Settings: Review das.config.fdm.isolationPolicyDelaySec (default 30 seconds). If your network environment is prone to longer flaps, this value can be adjusted, though it will delay HA response during genuine failures.

Additional Information

vSphere HA did not failover the VMs when the ESXi host was isolated from the network due to frequent NIC flapping

Advanced configuration options for VMware High Availability in vSphere 5.x, 6.x, 7.x and 8.x

Feedback

thumb_up Yes

thumb_down No