VMware ESXi
This issue occurs when physical network infrastructure changes (such as firewall or DNS maintenance) trigger physical NIC link flaps across multiple hosts in the cluster. If a host's management uplink (e.g., vmnic0) remains in a Link Down state for a prolonged period (typically exceeding 30 seconds), the vSphere HA Master host declares that host as Dead and initiates failover actions for its protected VMs.
Review the vmkernel.log on the affected ESXi hosts to identify the exact timestamps of link state changes:
YYYY-MM-DDTHH:MM:SSZ cpu<ID>:netschedHClk: NetSchedHClkNotify: vmnic0: link down notification
YYYY-MM-DDTHH:MM:SSZ cpu<ID>:netschedHClk: NetSchedHClkNotify: vmnic0: link up notification
Examine the fdm.log on the Master host to correlate the link loss with the host state change to Dead:
YYYY-MM-DDTHH:MM:SSZ info fdm[PID] [Originator@6876 sub=Invt] Host host-<ID> changed state: Dead
YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=Placement] Issue failover start event for <#> Vms
Confirm the placement and restart of VMs on surviving hosts:
YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=Execution] Place /vmfs/volumes/<UUID>/<VM_NAME>/<VM_NAME>.vmx on host-<ID>
YYYY-MM-DDTHH:MM:SSZ verbose fdm[PID] [Originator@6876 sub=FDM] New event: EventEx=com.vmware.vc.ha.VmRestartedByHAEvent vm=/vmfs/volumes/<VOLUME>/<VM_NAME>.vmx host=host-<ID>
Workaround/Prevention: