To determine if your vSphere HA cluster has experienced a host failure, perform these steps:
Review vCenter Server events
To review vCenter Server events:
- In vSphere Client, click the Tasks & Events tab.
- Click Events.
- Search for events with vSphere HA in the description.
- In the event of a host failure, you see messages similar to:
vSphere HA detected a host failure
vSphere HA detected a possible host failure of this host
Review vCenter Server logs
To review vCenter Server logs:
- On the vCenter Server, navigate to the vpxd-*.log file. For more information, see Location of vCenter Server log files (1021804).
- Search for the string, FDM state in the vpxd-*.log file.
Note: In this example, three hosts, host-1208, host-1214 and host-409 failed at the same time (14:28). The master of the cluster is host-406.
You see output similar to:
T14:28:01.491+02:00 [145400 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:05.126+02:00 [143416 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:05.640+02:00 [143416 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Live -> FDMUnreachable), src of state (host-406 -> host-406)
T14:28:10.320+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)
T14:28:10.898+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)
T14:28:10.913+02:00 [143356 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (FDMUnreachable -> Dead), src of state (host-406 -> host-406)
Note: hostIds differ from hostnames you specify in your environment. For more information on mapping between the hostname and the hostId see How to determine the mapping between hostname and hostId in a VMware HA cluster (2037000).
- When the outage is resolved, the hosts start up.You see output similar to:
T14:34:31.039+02:00 [141936 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:35:12.332+02:00 [144148 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1214 (initialized -> initialized), FDM state (FDMUnreachable -> Live), src of state (host-406 -> host-406)
T14:34:40.056+02:00 [143832 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:35:20.772+02:00 [140480 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-1208 (initialized -> initialized), FDM state (FDMUnreachable -> Live), src of state (host-406 -> host-406)
T14:35:20.351+02:00 [143476 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Dead -> FDMUnreachable), src of state (host-406 -> host-406)
T14:36:16.417+02:00 [135096 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (FDMUnreachable -> Uninitialized), src of state (host-406 -> host-409)
T14:36:19.038+02:00 [135096 info 'Default'] [VpxdMoHost::UpdateDasState] VC state for host host-409 (initialized -> initialized), FDM state (Uninitialized -> Live), src of state (host-409 -> host-406)
Review vmksummary log files on the hosts
Check the
vmksummary.log files for information on the type of outage, for example, a power outage or a purple diagnostic screen failure. The
vmksummary.log file contains
bootstop messages indicating start up and shutdown of the ESXi host. For more information, see
Format of the ESXi 5.0 vmksummary log file (2004566).
To review vmksummary logs:
- Log in to the three ESXi hosts as the root user.
- Navigate to the vmksummary.log file (located in /var/log/) on the three hosts.
Note: When a host shuts down unexpectedly, you do not see log information to indicate the shutdown. The logs indicate a start up only. In this example, all hosts start at the same time (12:35).
You see output similar to:
Host 1
T12:35:00Z bootstop: Host has booted
Host 2
T12:35:03Z bootstop: Host has booted
Host 3
T12:35:52Z bootstop: Host has booted
Note: In the case of a purple diagnostic screen failure, you see the message:
bootstop: Host has booted
bootstop: Core dump found
Note: Host log file time stamps may not be identical to vCenter Server log time stamps due to time zone or syncing settings.
Review FDM log files on the master and slave hosts
When vCenter Server logs are not available, you can review FDM files on the affected master and slave hosts.
To review fdm.log files on the affected master and slave hosts:
Note: You can check the vSphere HA host configuration in vSphere Client. Click to select the vSphere HA cluster, then click the Host tab. The vSphere HA State column indicates if the host is a master or a slave.
- Log in to the master and slave hosts as the root user.
- Navigate to the fdm.log files located at /var/log/).
- First, review the fdm.log file on the master host.
- The master host monitors the state of the slave hosts in the cluster. This communication is done through the exchange of network heartbeats every second. As soon as the slave has failed, these heartbeats will no longer be received by the master. You see output similar to:
T12:27:56.848Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:27:56.849Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1214
T12:27:56.849Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-409
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
- When the master host stops receiving network heartbeats from a slave host, it checks for the datastore heartbeat of the host, before declaring that the host has failed. The master host performs the heartbeat check to determine whether or not the slave host is exchanging heartbeats with datastores. You see output similar to:
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/5028de7e-36c10eaa-7037-0017a4770000 slave host-1208
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] slave host-1208 uuid.mac xx:xx:xx:xx:xx:xx
T12:27:57.851Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/5028de7e-36c10eaa-7037-0017a4770000 for slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] path /vmfs/volumes/4fbf41e5-da3ba7c6-993c-0017a4770460 slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] slave host-1208 uuid.mac xx:xx:xx:xx:xx:xx
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterDatastore::StartHBDatastoreChecking] Forcing heartbeat check on datastore /vmfs/volumes/4fbf41e5-da3ba7c6-993c-0017a4770460 for slave host-1208
T12:27:57.852Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::StartCheckingDatastoreHeartbeats] Starting datastore heartbeat checking for slave host-1208
T12:27:58.856Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:27:59.857Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:00.859Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:01.860Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:02.863Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:03.866Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Heartbeat still pending for slave @ host-1208
T12:28:04.868Z [48F4CB90 error 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::LiveCheck] Timeout for slave @ host-1208
- The master host also checks whether the host responds to ICMP pings sent to its management IP addresses. You see output similar to:
T12:28:04.869Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Marking slave host-1208 as unreachable
T12:28:04.869Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::UnreachableCheck] Beginning ICMP pings every 1000000 microseconds to host-1208
T12:28:04.870Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Reporting Slave host-1208 as FDMUnreachable
T12:28:04.871Z [48D85B90 info 'Invt' opID=SWI-d42459e8] [InventoryManagerImpl::ProcessHostChanges] Slave state of host-1208 changed to FDMUnreachable
T12:28:04.871Z [48D85B90 info 'Invt' opID=SWI-d42459e8] [HostStateChange::SaveToInventory] host host-1208 changed state: FDMUnreachable
T12:28:04.871Z [48D85B90 verbose 'PropertyProvider' opID=SWI-d42459e8] RecordOp ASSIGN: slave["host-1208"], fdmService
T12:28:09.883Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::UnreachableCheck] Waited 5 seconds for icmp ping reply for host host-1208
- If a master host is unable to communicate directly with the agent on a slave host, the slave host does not respond to ICMP pings, and the agent is not issuing heartbeats it is considered to have failed. You see output similar to:
T12:28:13.893Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] [ClusterSlave::PartitionCheck] Waited 15 seconds for disk heartbeat for host host-1208 - declaring dead
T12:28:13.894Z [48F4CB90 verbose 'Cluster' opID=SWI-2bc801f9] Reporting Slave host-1208 as Dead
T12:28:13.894Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::ProcessHostChanges] Slave state of host-1208 changed to Dead
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [VmStateChange::SavePowerChange] vm /vmfs/volumes/4fb0deb5-6496b476-6a16-0017a4770440/vm001/vm001.vmx curPwrState=powered on curPowerOnCount=1 newPwrState=unknown clnPwrOff=false hostReporting=host-1208
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::RemoveVmLocked] vm /vmfs/volumes/4fb0deb5-6496b476-6a16-0017a4770440/vm001/vm001.vmx (protected) removed from host host-1208; on 0 hosts
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [VmStateChange::SavePowerChange] vm /vmfs/volumes/5028de7e-36c10eaa-7037-0017a4770000/vm002/vm002.vmx curPwrState=powered on curPowerOnCount=1 newPwrState=unknown clnPwrOff=false hostReporting=host-1208
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [InventoryManagerImpl::RemoveVmLocked] vm /vmfs/volumes/5028de7e-36c10eaa-7037-0017a4770000/vm002/vm002.vmx (protected) removed from host host-1208; on 0 hosts
T12:28:13.895Z [48CC2B90 info 'Invt' opID=SWI-2c257561] [HostStateChange::SaveToInventory] host host-1208 changed state: Dead
- Review the fdm.log file on the slave host.
- The host network isolation occurs when a host is still running, but it can no longer contact the master on the management network. If the host is isolated, it will elect itself master, and then it will ping the isolation addresses as well as check for election traffic. If the pings fail and the master has no slaves, then the host will declare itself isolated. You see output similar to:
Note: In this example, the HA service runs until 12:24:19 without any error message, election process, or ping process. The service starts at 12:35:02.
T12:24:18.640Z [FFAA8B90 verbose 'Cluster'] [ClusterManagerImpl::NewHostCompatList] version 8960
T12:24:18.640Z [FFAA8B90 verbose 'Cluster'] [ClusterManagerImpl::Uncompress] Uncompressed from size 827 to size 4676
T12:24:18.641Z [FFAA8B90 verbose 'Cluster'] [ClusterManagerImpl::UpdatePersistentObject] name compatlist version (8960 ?> 8959) force false
T12:24:18.641Z [FFB6BB90 verbose 'Invt' opID=SWI-5681c315] [InventoryManagerImpl::ProcessHostCompatList] processing compat list version:8960
T12:24:18.641Z [FFA26B90 info 'Cluster' opID=SWI-5252d79c] [ClusterManagerImpl::StoreDone] Wrote host-compatabilty-list version 8960
T12:24:18.641Z [FFB6BB90 info 'Placement' opID=SWI-5681c315] [PlacementManagerImpl::CompatListListener::Handle] Get a new compat list
T12:24:19.639Z [FFCF1B90 verbose 'Election' opID=SWI-568b3a88] CheckVersion: Version[2] Other host GT : 8960 > 8959
T12:24:19.639Z [FFCF1B90 verbose 'Election' opID=SWI-568b3a88] CheckVersion: Pending version change 8960 >= 8960
T12:35:02.299Z [FFD18400 info 'Default'] Initialized channel manager
T12:35:02.323Z [FFD18400 info 'Default'] Current working directory: /vmfs/volumes/4f3282b8-f709df55-64d6-0017a4770442/log
T12:35:02.323Z [FFD18400 verbose 'ThreadPool'] Thread info: Min Io, Max Io, Min Task, Max Task, Max Thread, Keepalive, exit idle, idle secs, max fds: 2, 9, 2, 4, 13, 4, false, 600
T12:35:02.324Z [FFD18400 info 'Default'] Log path: /var/log/vmware/fdm