ESXi events flooding wtih "failed to find member XXXXX in cluster XXXXX" messages

Products

VMware vSphere ESXi

Issue/Introduction

An ESXi host is experiencing a high volume of log flooding, potentially generating over 200,000 events per day.
Thousands of occurrences of the following error in etcd logs: etcd[#####]: failed to find member ############ in cluster ############
The system may report that the etcd directory is unstable: vmkwarning: cpu10:#######)WARNING: UserFile: ####: etcd: Directory changing too often to perform readdir operation (11 retries), returning busy
Frequent restarts of the etcdmain process managed by the watchdog.

watchdog[XXXXXXX]: Started etcdmain with PID=#######
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=#######
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=#######
watchdog[XXXXXXX]: Restarting etcdmain
clusterAgent warnings indicating authentication handshake failures when connecting to peers over port 2379 as in below logs.

clusterAgent[#######]: WARN grpc: addrConn.createTransport failed to connect to {[ESXi-fqdn:2379] <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting... clusterAgent[#######]: WARN grpc: addrConn.createTransport failed to connect to {[ESXi-fqdn:2379] <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...

Below command shows that one of the host although reachable, reflects primary status in unknown state and missing api_address

/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster statusSample Output:

{   "state": "hosted",
   "cluster_id": "c######-####-4##d-85##-####023###2:domain-c####",
   "is_in_alarm": true,
   "alarm_cause": "Timeout",
   "is_in_cluster": true,
   "members": {
      "available": true
   },
   "namespaces": [
      {
         "name": "root",
         "up_to_date": true,
         "members": [
            {
               "peer_address": "<ESXi-fqdn:2380>",
               "api_address": "",
               "reachable": true,
               "primary": "unknown",
               "learner": false
            },

Environment

VMware vSphere ESXi

Cause

The issue is caused by a communication breakdown between the clusterAgent and the etcd cluster. Specifically, an authentication handshake failure prevents the host from establishing a secure gRPC connection, causing the etcd service to cycle due to flood of membership lookup errors.

Resolution

Make sure that the hosts are able to communicate over port 2379 and 2380 between each other. If there are any communication failure over the mentioned ports, please work on resolving the network connectivity issue. If the network connectivity looks good, then proceed with below workaround.

Workaround:

To resolve this issue and restore cluster communication, perform a graceful reboot of the affected ESXi host to ensure services start in the correct sequence.

Log in to the vSphere Client.
Identify the affected ESXi host.
Right-click the host and select Maintenance Mode > Enter Maintenance Mode. Ensure all Virtual Machines are evacuated or powered off.
Once in Maintenance Mode, right-click the host and select Power > Reboot.
After the host reboots and reconnects to vCenter, exit Maintenance Mode.
Verify the cluster status by logging into the ESXi shell via SSH and running: /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
Confirm that the api_address is no longer empty and the primary status is correctly identified (e.g., true or false, rather than "unknown").

While it is possible to attempt a restart of the etcd service via the CLI, a full host reboot is the recommended workaround to ensure all dependent service handshakes are correctly re-initialized.

If the issue persists after the host reboot, please open a case with Broadcom support team for further investigation.