vSphere HA Configuration fails with "HA Agent Unreachable" or "Operation Timed out" in vCenter Server
search cancel

vSphere HA Configuration fails with "HA Agent Unreachable" or "Operation Timed out" in vCenter Server

book

Article ID: 334269

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

  • Configuring HA fails with Operation Timed out or HA Agent Unreachable.
  • Completes on the HA primary ESXi host, but fails on all HA secondary ESXi hosts.
  • In the ESXi host's /var/log/fdm.log file, we see entries similar to:

    Example 1:
    YYYY-MM-DDTHH:MM:SS.341Z Db(167) Fdm[2676185]: [Originator@6876 sub=Cluster opID=WorkQueue-308315c0] IP xx.xx.xx.xx marked bad for reason Unreachable IP
    YYYY-MM-DDTHH:MM:SS.341Z In(166) Fdm[2676185]: [Originator@6876 sub=Message opID=WorkQueue-308315c0] Destroying connection
    YYYY-MM-DDTHH:MM:SS.341Z Wa(164) Fdm[2676185]: [Originator@6876 sub=VpxProfiler opID=WorkQueue-308315c0] WorkQueue [TotalTime] took 40014 ms
    YYYY-MM-DDTHH:MM:SST17:52:53.437Z Er(163) Fdm[2676185]: [Originator@6876 sub=Default] SSL Async Handshake Timeout : Read timeout after approximately 25000ms. Closing stream SSL(<io_obj p:0x000000d48c8eb8e0, h:29, <TCP 'xx.xx.xx.xx : 51525'>, <TCP 'xx.xx.xx.xx : 8182'>>)
    YYYY-MM-DDTHH:MM:SST17:52:53.438Z Wa(164) Fdm[2676192]: [Originator@6876 sub=IO.Connection opID=WorkQueue-308315c0] Failed to SSL handshake; SSL(<io_obj p:0x000000d48c8eb8e0, h:-1, <TCP 'xx.xx.xx.xx : 51525'>, <TCP 'xx.xx.xx.xx : 8182'>>), e: 125(Operation canceled), duration: 24102msec
    YYYY-MM-DDTHH:MM:SS.879Z Db(167) Fdm[2676323]: [Originator@6876 sub=HTTP.HTTPService] HTTP Response: Auto-completing at 118/118 bytes; <<io_obj p:xxxxxxxxxxxxxxxxxxx, h:26, <TCP '127.0.0.1 : 9089'>, <TCP '127.0.0.1 : 42794'>>, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
    YYYY-MM-DDTHH:MM:SS.879Z Db(167) Fdm[2676323]: [Originator@6876 sub=SOAP] Responded to service state request; <<io_obj p:xxxxxxxxxxxxxxxxxxx, h:26, <TCP '127.0.0.1 : 9089'>, <TCP '127.0.0.1 : 42794'>>, /fdm/service>
    YYYY-MM-DDTHH:MM:SS.882Z Er(163) Fdm[2676185]: [Originator@6876 sub=Vmomi opID=m0crhe7k-38536-auto-tqh-h5:70007578-a-DasRetryMgrPeriodic-74700c55-342fe3c9-cb] Caught exception while sending activation result; <<xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, <TCP '127.0.0.1 : 9089'>, <TCP '127.0.0.1 : 41157'>>, fdmServi
    ce, csi.FdmService.GetDebugManager, <csi.version.version1, official, 1.0>, <<io_obj p:xxxxxxxxxxxxxxxxxxx, h:26, <TCP '127.0.0.1 : 9089'>, <TCP '127.0.0.1 : 42794'>>, xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>>, N5Vmomi5Fault11SystemError9ExceptionE(Fault cause: vmodl.fault.SystemError


    Example 2:
    [40340B70 error 'Message' opID=SWI-28480d93] [MsgConnectionImpl::FinishSSLConnect] Error N7Vmacore3Ssl18SSLVerifyExceptionE(SSL Exception: Verification parameters:
    --> PeerThumbprint: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
    --> ExpectedThumbprint: xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx
    --> ExpectedPeerName: host-xxxxx
    --> The remote host certificate has these problems:
    -->
    --> * Host name does not match the subject name(s) in certificate.
    -->
    --> * unable to get local issuer certificate) on handshake
    [40340B70 warning 'Election' opID=SWI-28480d93] [MasterVerificationInfo::ConnectComplete] Failed to connect to master host-xxxxx
    [40340B70 verbose 'Election' opID=SWI-28480d93] [ClusterElection::AddInvalidMaster] Added invalid master host-xxxxx
    [40340B70 warning 'Election' opID=SWI-28480d93] [ClusterElection::UpdateInvalidMasterCountMap] Host host-xxxxx has been declared invalid 9 times
    [40340B70 info 'Message' opID=SWI-28480d93] Destroying connection
    [FFF45B70 verbose 'Cluster' opID=SWI-6058ed8] [ClusterManagerImpl::IsBadIP] xx.xx.xx.xx is bad ip
    [FFF45B70 verbose 'Cluster' opID=SWI-6058ed8] [ClusterManagerImpl::IsBadIP] xx.xx.xx.xx is bad ip

Cause

This issue can occur due to any of the below:

  • If Jumbo Frames is enabled on the ESXi host management network (VMkernel port used for ESXi host management), but there is a network misconfiguration at the physical network switch. This prevent ESXi hosts communicating using jumbo frames.
  • If network traffic throughput between ESXi hosts and vCenter Server is insufficient and the vCenter Server cannot push the FDM agent to the ESXi host over the network in a timely manner.
  • If MTU settings are different between all the ESXi hosts.
  • If the subnet mask is incorrect in the ESXi hosts.
  • If there is a mismatch between the FQDN and the actual DNS record of the ESXi hosts.
  • If there is a certificate change in the ESXi host.
  • If the root password of the ESXi hosts are changed.

Resolution

Follow the steps according to the scenarios below:

Scenario 1: If the MTU is mismatching on the Hosts, then follow the steps below:

Note: First make a note of the MTU settings from the network device configurations.

  1. In the vSphere Client, right-click the host in the cluster and then click the Configuration tab.
  2. Click Networking.
  3. Click Properties for Management Network on vSwitch0.
  4. Under the Ports tab, click vSwitch and then click Edit.
    • Check the current MTU value (9000 or 1500).
    • If the MTU is configured as 1500 then confirm the same is configured in the network device configurations as well.
    • If the MTU is configured as 9000 then confirm the same is configured in the network device configurations as well.
  5. Ensure that the MTU value is set to 9000 and not 1500 and then click OK.
  6. Repeat this process for the management network under the Ports tab, click Management Network and then click Edit.
  7. Ensure that the MTU value is set to 9000 and then click OK.
  8. Click OK.
  9. Then Right-click on the ESXi host and Click on Reconfigure for vSphere HA.
  10. Repeat the above steps on all the ESXi hosts in the cluster.

 

Scenario 2: If the subnet mask is different than the actual subnet mask then update the same following the steps below:

  1. Login to the DCUI console of the Host using the root credentials.
  2. Go to Troubleshooting options.
  3. Then to the IPv4 settings and update the correct subnet mask.
  4. And reboot the Host.

 

Scenario 3: If the FQDN is incorrect than the actual DNS records, then update the same following the steps below:

  1. Login to the DCUI console of the Host using the root credentials.
  2. Go to Troubleshooting options.
  3. Then to the IPv4 settings and update the correct FQDN.
  4. And reboot the Host.

Scenario 4:
a) For "The host name does not match the Subject Name(s) in certificate" error, but the ESXi host is installed with self-signed certificates, then regenerate new self-signed certificates by following the steps below:

  1. Place the ESXi host in maintenance mode.
  2. Login to ESXi host via SSH.
  3. Rename rui.crt & rui.key located at /etc/vmware/ssl, by running the below commands:
    mv /etc/vmware/ssl/rui.crt rui.cert.old
    mv /etc/vmware/ssl/rui.key rui.key.old
  4. Run this command to regenerate/renew the certificates:
    /sbin/generate-certificates
  5. Restart the management agents by below command.
    services.sh restart
  6. Exit the ESXi host from maintenance mode.

b) For "The host name does not match the Subject Name(s) in certificate" error, but the ESXi host is installed with custom certificates, then add custom certificate on the ESXi host by following the KB: Adding Custom Certificate on ESXi hosts through CLI



Scenario 5: If the root password of the ESXi host are changed.

  • Disconnect and reconnect the ESXi host in the cluster.