HA Failover did not occur

Products

VMware vSphere ESXi

Issue/Introduction

In the even of a host failure in a vSphere HA enabled cluster, virtual machines should be migrated to another healthy host, but the failover does not occur. This results in the virtual machines being powered off when the host failure occurs.

Environment

ESXi 8.0U3

Cause

The FDM logs show the following:

From the FDM logs, when the host goes down:

YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Heartbeat still pending for slave @ host-<hostNumber>
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Heartbeat still pending for slave @ host-<hostNumber>
YYYY-MM-DDTHH:MM:SS.sssZ In(166) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] hostId=host-<hostNumber>state=Master master=None isolated=false host-list-version=1140 config-version=3171 vm-metadata-version=13138 slv-mst-tdiff-sec=0
YYYY-MM-DDTHH:MM:SS.sssZ Er(163) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Timeout for slave @ host-<hostNumber>
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Marking slave host-hostNumber as unreachable
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Beginning ICMP pings every 1000000 microseconds to host-<hostNumber>
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100537] [Originator@6876 sub=Cluster opID=clusterManager.cpp:<UUID>] Start polling for ipV4 socket; sock: 7, vmknic: no_bind

HA is refusing to restart the VMs:

YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Removed 0 of 17 vms
YYYY-MM-DDTHH:MM:SS.sssZ In(166) Fdm[2100542] [Originator@6876 sub=Policy opID=WorkQueue-<UUID>] Sending a list of 17 VMs to the placement manager for placement
YYYY-MM-DDTHH:MM:SS.sssZ In(166) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] 17 Vms added, 17 VmRecord created
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Invoking the RPE + SPE Placement Engine
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Reevaluate all to-be-placed Vms.
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Failover operation in progress on 18 Vms: 0 VMs being restarted, 18 VMs waiting for a retry, 0 VMs waiting for resources, 0 inaccessible vSAN VMs.
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Vm /vmfs/volumes/<DatastoreUUID>/<VMFolder>/<VMName>.vmx will not be restarted because Host monitoring is disabled:
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Vm /vmfs/volumes/<DatastoreUUID>/<VMFolder>/<VMName>.vmx.vmx will not be restarted because Host monitoring is disabled:
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Vm /vmfs/volumes/<DatastoreUUID>/<VMFolder>/<VMName>.vmx.vmx will not be restarted because Host monitoring is disabled:
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Vm /vmfs/volumes/<DatastoreUUID>/<VMFolder>/<VMName>.vmx.vmx will not be restarted because Host monitoring is disabled:
YYYY-MM-DDTHH:MM:SS.sssZ Db(167) Fdm[2100542] [Originator@6876 sub=Placement opID=WorkQueue-<UUID>] Vm /vmfs/volumes/<DatastoreUUID>/<VMFolder>/<VMName>.vmx.vmx will not be restarted because Host monitoring is disabled:

This is expected behavior when HA is configured with the Host Monitoring setting is set to "Disabled".

Resolution

Check the HA settings for the cluster:

If host monitoring is "Disabled", set it to "Enabled". Then disable and re-enable HA.

Note: This setting is set to "Enabled" by default. It would have to manually be disabled in order for this scenario to occur.

Additional Information

If it is set to "Enabled", and the logs show that it is disabled, there is a mismatch between the UI and the internal HA settings on the cluster. Disable and re-enable HA in order to resolve the mismatch.