vSphere HA behaviour during ESXi Certificate Renewal.

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

Impact of Certificate Renewal on vSphere HA
VM failover can be initiated by vSphere HA when the ESXi certificate is renewed via vCenter.

Environment

VMware vCenter Server 7.0.x and later
VMware vSphere ESXi 7.0.x and later

Cause

Renewing the certificate via vCenter on an ESXi host can cause a brief disconnect from vCenter, as it restarts three services on ESXi: the management agents (vpxa, hostd) and the rhttpproxy service. This behavior is expected.

ESXI logs events during certificate renewal:

vpxa log :
info vpxa[303466] [Originator@6876 sub=Default] Received signal to reload SSL certificate
info vpxa[303466] [Originator@6876 sub=Default] Creating SSL Contexts
info vpxa[303466] [Originator@6876 sub=Default] Reloading SSL context for NFC Server
info vpxa[303466] [Originator@6876 sub=Default] Restarting Hostd Client Adapter
error vpxa[303380] [Originator@6876 sub=vpxaInvtHostCnx opID=WFU-207576ba] Can't connect to hostd. Shutting down...
info vpxa[303380] [Originator@6876 sub=Default opID=WFU-207576ba] [Vpxa] Shutting down now
hostd log:
info hostd[133498] [Originator@6876 sub=Solo] Received signal to reload SSL certificate
info hostd[133498] [Originator@6876 sub=Solo] Updated SSL certificate to 1 server socket(s)
warning hostd[133171] [Originator@6876 sub=IO.Connection opID=66878852] Failed to read buffer from stream; <io_obj p:0x0000007f7430ac78, h:54, <TCP '127.0.0.1 : 8307'>, <TCP '0.0.0.0 : 0'>> e: 104(Connection reset by peer), async: true, duration: 0msec
info hostd[133171] [Originator@6876 sub=SoapAdapter.HTTPService.HttpConnection opID=66878852] Failed to read header; <io_obj p:0x0000007f7430ac78, h:54, <TCP '127.0.0.1 : 8307'>, <TCP '0.0.0.0 : 0'>>: N7Vmacore15SystemExceptionE(Connection reset by peer: The connection is terminated by the remote end with a reset packet. Usually, this is a sign of a network problem, timeout, or service overload.)
--> [context]zKq7AVICAgAAAGsRawEJaG9zdGQAANJCF2xpYnZtYWNvcmUuc28AABygLQCcIjMAT1YuAMzHLQBUAy4AAhE/ATt9AGxpYnB0aHJlYWQuc28uMAACbdEObGliYy5zby42AA==[/context]
info hostd[133171] [Originator@6876 sub=IO.Connection opID=66878852] Failed to shutdown socket; <io_obj p:0x0000007f7430ac78, h:54, <TCP '127.0.0.1 : 8307'>, <TCP '0.0.0.0 : 0'>>, e: 104(shutdown: Connection reset by peer)
info hostd[133169] [Originator@6876 sub=Vimsvc.TaskManager opID=6687887e user=vpxuser] Task Created : haTask-ha-host-vim.host.ServiceSystem.refresh-1756748509
rhttpproxy.log
info rhttpproxy[132419] [Originator@6876 sub=RhttpProxy] Received signal to reload endpoints.
info rhttpproxy[132419] [Originator@6876 sub=RhttpProxy] Setup new endpoints
info rhttpproxy[132419] [Originator@6876 sub=Default] Processing file: /etc/vmware/rhttpproxy/endpoints.conf.d//ccp-endpoints.conf
info rhttpproxy[132419] [Originator@6876 sub=Default] Processing defaultMappingFile: /etc/vmware/rhttpproxy/endpoints.conf

Vcenter log events during certificate renewal:

vpxd.log
verbose vpxd[07132] [Originator@6876 sub=InvtHostCnx opID=HeartbeatStartHandler-352ba698] Need inventory sync for: [vim.HostSystem:host-10,esxi05.xxxxxx.xxxxx]
verbose vpxd[07132] [Originator@6876 sub=InvtHostCnx opID=HeartbeatStartHandler-352ba698] Queuing host sync; [vim.HostSystem:host-10,esxi05.xxxxxx.xxxxx]
verbose vpxd[07286] [Originator@6876 sub=InvtHostCnx opID=HB-host-10@1746-7a89a6bb] Synchronizing host; [vim.HostSystem:host-10,esxi05.xxxxxx.xxxxx]
verbose vpxd[07286] [Originator@6876 sub=InvtHostCnx opID=HB-host-10@1746-7a89a6bb] Processing vpxa changes: [vim.HostSystem:host-10,esxi06.xxxxxx.xxxxx], gen.no: from 1745 to 1746
verbose vpxd[07286] [Originator@6876 sub=Vmomi opID=HB-host-10@1746-7a89a6bb] [ClientAdapterBase::InvokeOnSoap] Invoke done (esxi05.xxxxxx.xxxxx, vpxapi.VpxaService.getChanges)
info vpxd[07286] [Originator@6876 sub=MoCluster opID=HB-host-10@1746-7a89a6bb] Host [vim.HostSystem:host-10,esxi06.xxxxxx.xxxxxl] has 1 HDCS resources
verbose vpxd[07286] [Originator@6876 sub=MoHost opID=HB-host-10@1746-7a89a6bb] Reserving 1 HDCS resources on host [vim.HostSystem:host-10,esxi06.xxxxxx.xxxxx]
verbose vpxd[07286] [Originator@6876 sub=MoHost opID=HB-host-10@1746-7a89a6bb] [vim.HostSystem:host-10,esxi06.xxxxxx.xxxxx]: cpuCapacity=2902 memCapacity=3934
verbose vpxd[07286] [Originator@6876 sub=ResMgr opID=HB-host-10@1746-7a89a6bb] Reloading host [vim.HostSystem:host-10,esxi05.xxxxxx.xxxxx]
verbose vpxd[07286] [Originator@6876 sub=InvtHostCnx opID=HB-host-10@1746-7a89a6bb] Done synchronizing host; [vim.HostSystem:host-10,esxi05.xxxxxx.xxxxx]
info vpxd[08620] [Originator@6876 sub=certmgrLogger opID=m7jjkoa3-40046-auto-uwg-h5:70008454-e7] Refreshing certs on 1 hosts using 5 threads

Impact on High Availability (HA):

As a result of the certificate renewal, vSphere HA may initiate a failover since it detects the host disconnect. This is also the expected behavior in this scenario.
HA will handle the failover automatically if sufficient resources are available to accommodate the VM.
If Certificate is renewed on any of the HA Cluster host, it will initiate a HA reconfiguration. This is also the expected behavior in this scenario.
When certificate renewed on the primary host, the re-election of primary host is initiated. This is also the expected behavior in this scenario.

FDM logs events after certificate renewal:

fdm.log :
- time the service was last started YYYY-MM-DDThh:mm:ss.487Z, Section for VMware Fault Domain Manager, pid=357900, version=7.0.3, build=24321951, option=Release
info fdm[357901] [Originator@6876 sub=Default] Initializing SSL
info fdm[357900] [Originator@6876 sub=Libs] lib/ssl: OpenSSL using FIPS_drbg for RAND
info fdm[357900] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2
info fdm[357900] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2 (openssl flags 0x17000000)
info fdm[357900] [Originator@6876 sub=Libs] lib/ssl: cipher list ECDHE+AESGCM:RSA+AESGCM:ECDHE+AES:RSA+AES
info fdm[357900] [Originator@6876 sub=Libs] lib/ssl: curves list prime256v1:secp384r1:secp521r1
info fdm[357900] [Originator@6876 sub=Default] Vmacore::InitSSL: handshakeTimeoutUs = 20000000
info fdm[357900] [Originator@6876 sub=Default] Service is running in FIPS mode.
info fdm[357900] [Originator@6876 sub=Default] Creating SSL Contexts
info fdm[357901] [Originator@6876 sub=Libs] lib/ssl: OpenSSL using FIPS_drbg for RAND
info fdm[357901] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2
info fdm[357901] [Originator@6876 sub=Libs] lib/ssl: protocol list tls1.2 (openssl flags 0x17000000)
info fdm[357901] [Originator@6876 sub=Libs] lib/ssl: cipher list ECDHE+AESGCM:RSA+AESGCM:ECDHE+AES:RSA+AES
info fdm[357914] [Originator@6876 sub=ThreadPool] Entering worker thread loop
info fdm[357901] [Originator@6876 sub=Libs] lib/ssl: curves list prime256v1:secp384r1:secp521r1
info fdm[357900] [Originator@6876 sub=Default] [Fdm_Main] Starting VMware Fault Domain Manager 7.0.3 build-24321951
info fdm[357901] [Originator@6876 sub=Default] Vmacore::InitSSL: handshakeTimeoutUs = 20000000

Lab Testing:

Test Setup: A three-host ESXi cluster with HA enabled:
Primary host : esxi05

Lab Test 1: certificate renewed on the primary host in HA cluster

Certificate renewal initiated for ESXi, which triggered HA reconfiguration
HA agent became unreachable.
HA state changed to uninitialized.
HA state changed to Election for all hosts.

Lab Test 2: certificate renewed on the non-primary host in HA cluster

Certificate renewal was initiated for ESXi esxi06, which triggered HA reconfiguration.
Host state changed to not responding and HA state changed to Agent Unreachable

Observation:

In some instances, the disconnect is so brief that it goes unnoticed in the vCenter UI. For example, in Lab Test 1, the disconnect on esxi05 was minimal and did not impact HA behavior.
However, in Lab Test 2, esxi06 changed to a Not Responding state, indicating a complete disconnect from vCenter. In this situation, HA may initiate a VM failover.

Resolution

Example : CPU reservations for the VMs impacted the failover process, preventing proper placement on other hosts.

Recommendations:
To mitigate HA failover issues caused by resource constraints during certificate renewal, consider the following best practices:

Temporarily disable Resources reservations on VMs before renewing the certificate to facilitate smoother failover in case of resource limitations.
Ensure sufficient cluster resources are available to handle failover scenarios if an unexpected host disconnect occurs.
Temporarily disable vSphere HA during the certificate renewal process to prevent unnecessary failover events.

Note : if HA fails to place the VM on a compatible host due to resource limitations, this is not related to the certificate renewal but rather to the availability of cluster resources.

By implementing these best practices, administrators can reduce the impact of certificate renewal on HA behavior and ensure a seamless process without VM disruptions.

Additional Information

Renew or Refresh ESXi Certificates : https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/7-0/vsphere-security-7-0/securing-esxi-hosts/certificate-management-for-esxi-hosts/renew-esxi-certificates.html