Troubleshooting VMware High Availability (HA) issues in VMware vCenter Server
search cancel

Troubleshooting VMware High Availability (HA) issues in VMware vCenter Server

book

Article ID: 318936

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article discusses troubleshooting VMware High Availability (HA) issues in VMware vCenter Server.

HA agent state failure

  • Enabling VMware HA fails
  • After upgrading the VMware vCenter Server, VMware High Availability (HA) is no longer working.

Symptoms:

  • Error similar to following is displayed in the vSphere Client:
    Operation Timed out

Environment

  • vCenter Server 8.0.x
  • vCenter Server 7.0.x
  • vCenter Server 6.x

Cause

vSphere HA (internally Fault Domain Manager - FDM) provides high availability for virtual machines by pooling the virtual machines and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

When a vSphere HA cluster is created, a single host is automatically elected as the primary host. The primary host communicates with vCenter Server and monitors the state of all protected virtual machines and of the secondary hosts. Different types of host failures are possible, and the primary host must detect and appropriately deal with the failure. The primary host must distinguish between a failed host and one that is in a network partition or that has become network isolated. The primary host uses network and datastore heartbeating to determine the type of failure.

Resolution

In order to troubleshoot HA, the following items should be reviewed to find the issue: 

  • The FDM log in the Host: 
    • less var/run/log/fdm.log 
  • Datastore heartbeat misses can indicate network/storage connectivity issues 
  • HA requires TCP and UDP traffic on port 8182 to be open between the all the hosts. The master hosts uses this port to the check the liveliness, see vSphere HA Security   

For more information about HA in vCenter Server 7.x, see How vSphere HA Works.

Known Issues

Common Misconfiguration Issues

  • FDM configuration can fail if ESXi hosts are connected to network switches with automatic anti-DOS features.

  • FDM does support Jumbo Frames, but the MTU setting has to be consistent from end to end on every device.

  • Some firewall devices block ICMP pings that have an ID of zero. In such cases, FDM could report that some or all secondary hosts cannot ping each other, and/or that the isolation addresses cannot be reached.


Troubleshooting issues with FDM:

  1. Ensure HA is configured correctly. For information, see How vSphere HA Works.
  2. Verify that network connectivity exists from the vCenter Server to the ESXi host. For more information, see Testing network connectivity with the ping command
  3. Verify that the hosts are hosts are able to communicate using TCP and UDP over port 8182. Refer the kb for troubleshooting the 8182 port connectivity between the hosts Troubleshooting network and TCP/UDP port connectivity issues on Hosts 
  4. Verify that the ESXi host is properly connected to vCenter Server. For more information, see Changing an ESXi or ESX host's connection status in vCenter Server.
  5. Verify that the datastore(s) used for HA heartbeats is accessible by all hosts., see configure heartbeat datastores and HA error: "The number of heartbeat datastores for host is 1, which is less than required: 2"
  6. Verify that all the configuration files of the FDM agent were pushed successfully from the vCenter Server to the ESXi host:
     
    • Location on ESXi Host: /etc/opt/vmware/fdm
    • File Names: clusterconfig (cluster configuration), compatlist (host compatibility list for virtual machines), hostlist (host membership list), and fdm.cfg.
       
  7. Increase the verbosity of the FDM logs to get more information about the cause of the issue. 
    SSH to ESXi host and change the below entry in /etc/opt/vmware/fdm/fdm.cfg
    <log>
    ...
    <level>verbose</level>
    ...
    </log>

    To:

    <log>
    ...
    <level>trivia</level>
    ...
    </log>
  8. Search the log files in ESXi host for any error message:
     
    • /var/run/log/fdm.log (one log file for FDM operations)
    • /var/run/log/fdm-installer.log (FDM agent installation log for any fdm installation failure) 
  9. Try to delete the fdm vib and installing it manually, see vSphere HA agent cannot be installed or configured and Resolve third-party VIB preventing vSphere HA agent updates
  10. Contact FDM's Managed Object Browser (MOB), at https://hostname/mobfdm, for more information. The MOB can be used to dump debug information about FDM to /var/log/vmware/fdm/fdmDump.log file. It can also provide key information about the status of FDM from the perspective of the local ESX server: a list of protected virtual machines, secondary host, events etc. For more information, see the Managed Object Browser section in the vSphere Web Services SDK Programming Guide

If the issue persists, file a support request with VMware Support and quote this Knowledge Base article ID (318936) in the problem description.