Determining if your VMware vSphere HA cluster has experienced a host failure

search cancel

Determining if your VMware vSphere HA cluster has experienced a host failure

book

Article ID: 324992

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article provides steps to determine if your VMware vSphere High Availability (HA) cluster has experienced a host failure and helps to identify what to look for in vCenter Server and host log files.

Environment

VMware vSphere 6.x

VMware vSphere 7.x

VMware vSphere 8.x

Resolution

To determine if your vSphere HA cluster has experienced a host failure, perform these steps:

Review vCenter Server events
Review vCenter Server logs
Review vmksummary log files on the hosts
Review FDM log files on the master and slave hosts

Review vCenter Server events

To review vCenter Server events:

In vSphere Client, click the Tasks & Events tab.
Click Events.
Search for events with vSphere HA in the description.
In the event of a host failure, you see messages similar to:

vSphere HA detected a host failure

vSphere HA detected a possible host failure of this host

Review vCenter Server logs

To review vCenter Server logs:

On the vCenter Server, navigate to the vpxd-*.log file. For more information, see Location of vCenter Server log files (1021804).

Search for the string, FDM state in the vpxd-*.log file.

Note: In this example, three hosts, host-1208, host-1214 and host-409 failed at the same time (14:28). The master of the cluster is host-406.

You see output similar to:

T14:28:01.491+02:00 [145400 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:05.126+02:00 [143416 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:05.640+02:00 [143416 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:10.320+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)
T14:28:10.898+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)
T14:28:10.913+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)

Note: hostIds differ from hostnames you specify in your environment. For more information on mapping between the hostname and the hostId see How to determine the mapping between hostname and hostId in a VMware HA cluster (2037000).

When the outage is resolved, the hosts start up. You see output similar to:

T14:34:31.039+02:00 [141936 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:35:12.332+02:00 [144148 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (FDMUnreachable -> Live), src of state (host-406 -> host-406)
T14:34:40.056+02:00 [143832 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:35:20.772+02:00 [140480 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (FDMUnreachable -> Live), src of state (host-406 -> host-406)
T14:35:20.351+02:00 [143476 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:36:16.417+02:00 [135096 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (FDMUnreachable -> Uninitialized), src of state (host-406 -> host-409)
T14:36:19.038+02:00 [135096 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Uninitialized -> Live), src of state (host-409 -> host-406)

Review vmksummary log files on the hosts

Check the vmksummary.log files for information on the type of outage, for example, a power outage or a purple diagnostic screen failure. The vmksummary.log file contains bootstop messages indicating start up and shutdown of the ESXi host.

To review vmksummary logs:

Log in to the three ESXi hosts as the root user.
Navigate to the vmksummary.log file (located in /var/log/) on the three hosts.

Note: When a host shuts down unexpectedly, you do not see log information to indicate the shutdown. The logs indicate a start up only. In this example, all hosts start at the same time (12:35).

You see output similar to:

Host 1

T12:35:00Z bootstop: Host has booted

Host 2

T12:35:03Z bootstop: Host has booted

Host 3

T12:35:52Z bootstop: Host has booted

Note: In the case of a purple diagnostic screen failure, you see the message:

bootstop: Host has booted
bootstop: Core dump found

Note: Host log file time stamps may not be identical to vCenter Server log time stamps due to time zone or syncing settings.

Review FDM log files on the master and slave hosts

When vCenter Server logs are not available, you can review FDM files on the affected master and slave hosts.

To review fdm.log files on the affected master and slave hosts:

Note: You can check the vSphere HA host configuration in vSphere Client. Click to select the vSphere HA cluster, then click the Host tab. The vSphere HA State column indicates if the host is a master or a slave.

Log in to the master and slave hosts as the root user.
Navigate to the fdm.log files located at /var/log/).

First, review the fdm.log file on the master host.

The master host monitors the state of the slave hosts in the cluster. This communication is done through the exchange of network heartbeats every second. As soon as the slave has failed, these heartbeats will no longer be received by the master. You see output similar to:

T12:27:56.848Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:27:56.849Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1214
T12:27:56.849Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-409
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208

When the master host stops receiving network heartbeats from a slave host, it checks for the datastore heartbeat of the host, before declaring that the host has failed. The master host performs the heartbeat check to determine whether or not the slave host is exchanging heartbeats with datastores. You see output similar to:

T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/Datastore_UUID slave host-1208
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] slave host-1208 uuid.mac xx:xx:xx:xx:xx:xx
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/Datastore_UUID for slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/Datastore_UUID slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] slave host-1208 uuid.mac xx:xx:xx:xx:xx:xx
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/Datastore_UUID for slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::StartCheckingDatastoreHeartbeats] Starting datastore heartbeat checking for slave host-1208
T12:27:58.856Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:27:59.857Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:00.859Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:01.860Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:02.863Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:03.866Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:04.868Z [48F4CB90 error 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Timeout for slave @ host-1208

The master host also checks whether the host responds to ICMP pings sent to its management IP addresses. You see output similar to:

T12:28:04.869Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Marking slave host-1208 as unreachable
T12:28:04.869Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::UnreachableCheck] Beginning ICMP pings every 1000000 microseconds to host-1208
T12:28:04.870Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Reporting Slave host-1208 as FDMUnreachable
T12:28:04.871Z [48D85B90 info 'Invt' opID=SWI-d42459e8] [InventoryManagerImpl::ProcessHostChanges] Slave state of host-1208 changed to FDMUnreachable
T12:28:04.871Z [48D85B90 info 'Invt' opID=SWI-d42459e8] [HostStateChange::SaveToInventory] host host-1208 changed state: FDMUnreachable
T12:28:04.871Z [48D85B90 verbose 'PropertyProvider' opID=SWI-d42459e8] RecordOp ASSIGN: slave["host-1208"], fdmService
T12:28:09.883Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::UnreachableCheck] Waited 5 seconds for icmp ping reply for host host-1208

If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed. You see output similar to:

T12:28:13.893Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::PartitionCheck] Waited 15 seconds for disk heartbeat for host host-1208 - declaring dead
T12:28:13.894Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Reporting Slave host-1208 as Dead
T12:28:13.894Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::ProcessHostChanges] Slave state of host-1208 changed to Dead
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [VmStateChange::SavePowerChange] vm /vmfs/volumes/Datastore_UUID/vm001/vm001.vmx curPwrState=powered on curPowerOnCount=1 newPwrState=unknown clnPwrOff=false hostReporting=host-1208
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::RemoveVmLocked] vm /vmfs/volumes/Datastore_UUID/vm001/vm001.vmx (protected) removed from host host-1208; on 0 hosts
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [VmStateChange::SavePowerChange] vm /vmfs/volumes/Datastore_UUID/vm002/vm002.vmx curPwrState=powered on curPowerOnCount=1 newPwrState=unknown clnPwrOff=false hostReporting=host-1208
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::RemoveVmLocked] vm /vmfs/volumes/Datastore_UUID/vm002/vm002.vmx (protected) removed from host host-1208; on 0 hosts
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [HostStateChange::SaveToInventory] host host-1208 changed state: Dead

Review the fdm.log file on the slave host.
- The host network isolation occurs when a host is still running, but it can no longer contact the master on the management network. If the host is isolated, it will elect itself master, and then it will ping the isolation addresses as well as check for election traffic. If the pings fail and the master has no slaves, then the host will declare itself isolated. You see output similar to:
  
  Note: In this example, the HA service runs until 12:24:19 without any error message, election process, or ping process. The service starts at 12:35:02.

Additional Information

Location of vCenter Server log files

Feedback

thumb_up Yes

thumb_down No