Troubleshooting VMware High Availability (HA) issues in VMware vCenter Server
search cancel

Troubleshooting VMware High Availability (HA) issues in VMware vCenter Server

book

Article ID: 318936

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

This article discusses troubleshooting VMware High Availability (HA) issues in VMware vCenter Server. 

HA agent state failure

  • Enabling VMware HA fails
  • After upgrading the VMware vCenter Server, VMware High Availability (HA) is no longer working.

Symptoms:

  • You see the error:
    Operation Timed out

Environment

  • vCenter Server 8.0.x
  • vCenter Server 7.0.x
  • vCenter Server 6.x
  • vCenter Server 5.0.x

Cause

vSphere HA provides high availability for virtual machines by pooling the virtual machines and the hosts they reside on into a cluster. Hosts in the cluster are monitored and in the event of a failure, the virtual machines on a failed host are restarted on alternate hosts.

When you create a vSphere HA cluster, a single host is automatically elected as the primary host. The primary host communicates with vCenter Server and monitors the state of all protected virtual machines and of the secondary hosts. Different types of host failures are possible, and the primary host must detect and appropriately deal with the failure. The primary host must distinguish between a failed host and one that is in a network partition or that has become network isolated. The primary host uses network and datastore heartbeating to determine the type of failure.

Resolution

In order to troubleshoot HA, we can check the following to find the issue: 

  • You can find the FDM log in the vCenter by: 
    • SSH into vCenter Server
    • less var/log/fdm.log 
  • Datastore Heartbeats can indicate network/storage connectivity issues 
  •  

For more information about HA in vCenter Server 7.x, see How vSphere HA Works.

Known Issues

Common Misconfiguration Issues

  • FDM configuration can fail if ESX hosts are connected to switches with automatic anti-DOS features.

  • FDM does support Jumbo Frames, but the MTU setting has to be consistent from end to end on every device.

  • Some firewall devices block ICMP pings that have an ID of zero. In such cases, FDM could report that some or all secondary hosts cannot ping each other, and/or that the isolation addresses cannot be reached.


Troubleshooting issues with FDM:

  1. Ensure that you have properly configured HA. For information, see  How vSphere HA Works.
  2. Verify that network connectivity exists from the vCenter Server to the ESXi host. For more information, see Testing network connectivity with the ping command (315423).
  3. Verify that the ESXi Host is properly connected to vCenter Server. For more information, see Changing an ESXi or ESX host's connection status in vCenter Server (303652).
  4. Verify that the datastore used for HA heartbeats is accessible by all hosts.
  5. Verify that all the configuration files of the FDM agent were pushed successfully from the vCenter Server to your ESXi host:
     
    • Location: /etc/opt/vmware/fdm
    • File Names: clusterconfig (cluster configuration), compatlist (host compatibility list for virtual machines), hostlist (host membership list), and fdm.cfg.
       
  6. Increase the verbosity of the FDM logs to get more information about the the cause of the issue. 
    Change the below entry in /etc/opt/vmware/fdm/fdm.cfg
    <log>
    ...
    <level>verbose</level>
    ...
    </log>

    To:

    <log>
    ...
    <level>trivia</level>
    ...
    </log>
  7. Search the log files for any error message:
     
    • /var/log/fdm.log or /var/run/log/fdm* (one log file for FDM operations)
    • /var/log/fdm-installer.log (FDM agent installation log)
       
  8. Contact FDM's Managed Object Browser (MOB), at https://hostname/mobfdm, for more information. The MOB can be used to dump debug information about FDM to /var/log/vmware/fdm/fdmDump.log file. It can also provide key information about the status of FDM from the perspective of the local ESX server: a list of protected virtual machines, secondary host, events etc. For more information, see the Managed Object Browser section in the vSphere Web Services SDK Programming Guide.
     

If the issue persists, file a support request with VMware Support and quote this Knowledge Base article ID (318936) in the problem description.