HA fails to enable with "vSphere HA agent for host <Hostname> has an error in <Cluster-name> in <Datacenter> : vSphere HA agent cannot be installed or configured" following a vCenter- or host update or upgrade
search cancel

HA fails to enable with "vSphere HA agent for host <Hostname> has an error in <Cluster-name> in <Datacenter> : vSphere HA agent cannot be installed or configured" following a vCenter- or host update or upgrade

book

Article ID: 418377

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

  • Following an upgrade from vSphere 7.0.x to vSphere 8.0.x, you notice that vSphere HA fails to enable, as vSphere Client shows the following error message for the ESXi hosts:

vSphere HA host status

  • In Addition, the following error is shown:

vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server

  • Reviewing the log for the HA agent on an affected host (/var/run/log/fdm.log), you see the multiple error entries related to SSL, including 'Failed to SSL handshake' and '(no suitable key share (SSL routines)' similar to the example below:

# grep -i "Failed to SSL handshake "/var/run/log/fdm.log:
YY-MM-DDThh:mm:ss.xxxZ Wa(164) Fdm[2100732]: [Originator@6876 sub=IO.Connection opID=WorkQueue-###] Failed to SSL handshake; SSL(<io_obj p:0x0000001c90649e80, h:10, <TCP '<other_host_IP> : 8182'>, <TCP '<host_IP> : 44528'>>), e: 478150758(cipher operation failed (Provider routines)), duration: 182msec
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100733]: [Originator@6876 sub=Message opID=WorkQueue-###] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:1C800066:Provider routines::cipher operation failed)
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100702]: --> [context]zKq7AVECAQAAAI48ewEKZmRtAIA8lIEBZmRtAIAJMWcBgFu5agGA5LtqAYCavWoBgN4fbAGAwFBsAYCL7YwBAVJ4AGxpYnB0aHJlYWQuc28uMAACP1IPbGliYy5zby42AA==[/context] creating ssl stream or doing handshake
YY-MM-DDThh:mm:ss.xxxZ Wa(164) Fdm[2100719]: [Originator@6876 sub=IO.Connection opID=WorkQueue-###] Failed to SSL handshake; SSL(<io_obj p:0x0000001c9064f530, h:10, <TCP '<other_host_IP> : 8182'>, <TCP '<host_IP> : 44544'>>), e: 167772441(decryption failed or bad record mac (SSL routines)), duration: 182msec
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100939]: [Originator@6876 sub=Message opID=WorkQueue-###] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:0A000119:SSL routines::decryption failed or bad record mac)
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100702]: --> [context]zKq7AVECAQAAAI48ewEKZmRtAIA8lIEBZmRtAIAJMWcBgFu5agGA5LtqAYCavWoBgN4fbAGAwFBsAYCL7YwBAVJ4AGxpYnB0aHJlYWQuc28uMAACP1IPbGliYy5zby42AA==[/context] creating ssl stream or doing handshake
YY-MM-DDThh:mm:ss.xxxZ Wa(164) Fdm[2100731]: [Originator@6876 sub=IO.Connection opID=WorkQueue-###] Failed to SSL handshake; SSL(<io_obj p:0x0000001c9057dc40, h:10, <TCP '<other_host_IP> : 8182'>, <TCP '<host_IP> : 44570'>>), e: 167772309(digest check failed (SSL routines)), duration: 183msec
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100945]: [Originator@6876 sub=Message opID=WorkQueue-###] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:0A000095:SSL routines::digest check failed)
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100702]: --> [context]zKq7AVECAQAAAI48ewEKZmRtAIA8lIEBZmRtAIAJMWcBgFu5agGA5LtqAYCavWoBgN4fbAGAwFBsAYCL7YwBAVJ4AGxpYnB0aHJlYWQuc28uMAACP1IPbGliYy5zby42AA==[/context] creating ssl stream or doing handshake
YY-MM-DDThh:mm:ss.xxxZ Wa(164) Fdm[2100718]: [Originator@6876 sub=IO.Connection opID=WorkQueue-###] Failed to SSL handshake; SSL(<io_obj p:0x0000001c90612780, h:10, <TCP '<other_host_IP> : 8182'>, <TCP '<host_IP> : 45286'>>), e: 167772261(no suitable key share (SSL routines)), duration: 0msec
YY-MM-DDThh:mm:ss.xxxZ Er(163) Fdm[2100946]: [Originator@6876 sub=Message opID=WorkQueue-###] Error N7Vmacore3Ssl12SSLExceptionE(SSL Exception: error:0A000065:SSL routines::no suitable key share)

  • The certificate management mode, defined in the vCenter advanced option vpxd.certmgmt.mode in vCenter, is set to "vmca":

Environment

VMware vSphere 8.0.x

Cause

This issue can occur when there is a mismatch between the certificate issuer key in the current ESXi host certificates and the VMCA root certificate in the vCenter Server.

If the vCenter advanced option vpxd.certmgmt.mode is set to "vmca", vCenter Server will deploy a new host certificate issued by its embedded certificate authority (VMCA) to any host as soon as it is registered in vCenter. However, if the VMCA root certificate gets replaced or renewed, the host certificate is not updated automatically at the same time.

As a result, the Authority Key Identifier in the host certificate will no longer match the Subject Key Identifier of the VMCA root certificate, because this identifier is specific to each certificate.

 

To verify the issue, identify the certificate currently used as VMCA root certificate and compare its Subject Key Identifier to the Authority Key Identifier of the ESXi host certificate.

Start with listing the Subject Key Identifiers of the entries present in the VECS TRUSTED_ROOTS certificate store in the vCenter Server Appliance (VCSA).

  1. Open an SSH connection to the VCSA and login with root
  2. Query the TRUSTED_ROOTS store entries with their Subject Key Identifiers: 

    # /usr/lib/vmware-vmafd/bin/vecs-cli entry list -- store TRUSTED_ROOTS -- text | egrep -i -A1 "Alias | X509v3 Subject Key Identifier:"

  3. The output should look like this (AA and BB are being used as replacers here, in the real world those will be hexadecimal values):

Alias : <alias_ID_1>
Entry type :    Trusted Cert
--
            X509v3 Subject Key Identifier:
                AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA

Alias : <alias_ID_2>
Entry type :    Trusted Cert
--
            X509v3 Subject Key Identifier:
               
BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB

  1. Next, review the Authority Key Identifiers of the leaf certificates in vCenter (e.g. the machine SSL certificate and for one or more of the solution user certificates like vpxd or wcp can be used for this):

# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store MACHINE_SSL_CERT --text | grep -i -A1 "X509v3 Authority Key Identifier:"
            X509v3 Authority Key Identifier:
                BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB
# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store wcp --text | grep -i -A1 "X509v3 Authority Key Identifier:"
            X509v3 Authority Key Identifier:
                BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB
# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store vpxd --text | grep -i -A1 "X509v3 Authority Key Identifier:"
            X509v3 Authority Key Identifier:
                BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB

In this example, BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB:BB is the Authority Key Identifier of the vCenter leaf certificates. Therefore the entry in TRUSTED_ROOTS which has this value as its Subject Key Identifier is the current VMCA root certificate. 

  1. Now compare those identifiers to the one included in the ESXi host certificate, by using the following openssl command:

# openssl s_client -connect <host_IP_or_FQDN>. lab: 443 | openssl x509 -noout -text | grep -i -Al "X509v3 Authority Key Identifier:"
depth=1 CN = CA, DC = vsphere, DC = local, C = US, ST = California, 0 = <vcenter_PNID>, OU = VMware Engineering
verify return:1
depth=0 CN = <vCenter_FQDN>, C = US
verify return:1
            X509v3 Authority Key Identifier:
                AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA:AA


In the example, the Authority Key Identifier in the host certificate does not match the Subject Key Identifier of the VMCA root certificate, which is why the issue occurs.

 

In Mixed SSL certificate cases, where the GUI/ MACHINE_SSL_CERT certificate is signed by a 3rd party authority, the result from below, is expected and will not impact 

# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store MACHINE_SSL_CERT --text | grep -i -A 1 "X509v3 Authority Key Identifier:"
            X509v3 Authority Key Identifier:
                CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC:CC

Resolution

 

To fix this issue, please follow the follow the steps outlined in https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/vsphere-security-8-0/securing-esxi-hosts/certificate-management-for-esxi-hosts/renew-esxi-certificates.html 

  1. In vSphere Client, select one of the affected hosts, then go to Configure > Certificate
  2. Then click again on "Manage with VMCA" and select "Refresh CA Certificates" to push the current trusted CA certificates including the VMCA certificate into the trusted root certificate store on the ESXi host
  3. Now click on "Manage with VMCA" and in the menu select "Renew" to install a new VMCA signed certificate to the ESXi host
  4. Restart hostd and vpxa on the host as described in Restarting Management Agents in ESXi
  5. Repeat those steps for all other affected hosts
  6. Reenable vSphere HA for the cluster