etcd service on ESXi host fails after the host is renamed

Products

VMware vSphere ESXi

Issue/Introduction

After an ESXi host is renamed, the etcd service will fail since it's not automatically reconfigured to use the new hostname.
You may see the following log entries:
- /var/log/etcd.log
  - health check for peer <ETCD_MEMBER> could not connect: dial tcp: lookup <ESXi_HOST> on <ESXi_HOST_IP>:53: no such host
- /var/log/vmkernel.log
  - cpu84:1001394011)VmkAccess: SocketInetConnect:149: etcd: running in etcdDom(49): ipAddr = <IPV6_ADDRESS>::, port = 9: Access denied by vmkernel access control policy
- /var/log/clusterAgent.log
  - No(5) clusterAgent[525412]: WARN grpc: addrConn.createTransport failed to connect to <ESXi_HOST>:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup <ESXi_HOST>: no such host". Reconnecting...
  - 2025-01-01T13:37:35Z No(5) clusterAgent[525412]: ERROR Failed to prepare supervisor state
The ESXi host's etcd configuration file (etcd.yml) will reference the host's previous name:
- initial-cluster: <ETCD_CLUSTER_ID>=https://<OLD_HOSTNAME>:2380
Running /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status will show that the etcd cluster is in an alarm state and can't reach its members:
- {
  "state": "hosted",
  "cluster_id": "<CLUSTER_ID>",
  "is_in_alarm": true,
  "alarm_cause": "Timeout",
  "is_in_cluster": true,
  "members": {
  "available": false
  }
  }
This issue can occur even if WCP/TKG isn't in use.

Environment

vSphere 8

Cause

When a ESXi host is renamed, its etcd configuration isn't updated automatically. This causes the ESXi hosts in the cluster that are etcd members to attempt to contact each other using their old host names. If there isn't a DNS record for these host names, communication will fail and generate messages in /var/log/etcd.log

Resolution

1. Each vCenter cluster will contain three ESXi hosts that are etcd cluster members. The other ESXi hosts in a cluster will not be etcd cluster members so they may not have the etcd.yml configuration file and the commands below do not need to be run on them.
2. ssh onto each ESXi host and find the etcd.yml configuration file:
  - find /vmfs/volumes -name etcd.yml -print | head -n 1
    - /vmfs/volumes/<VMFS_VOLUME_ID>/cache/datafiles/esx#cluster_agent_data/etcd.yml
3. Verify that the configuration file references the old hostname:
  - initial-cluster: <ETCD_CLUSTER_ID>=https://<OLD_HOSTNAME>:2380
  - There will be other lines with the old hostname as well. Do not manually update these entries.
4. Run the following command to make the etcd node to run in standalone mode:
  - /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster forceStandalone (no output means it's a success)
5. Once you've set all three ESXi hosts that are etcd members to standalone mode, restart the vpxd service on vCenter:
  - service-control --restart vpxd
6. Once vpxd has finished restarting, verify that the etcd.yaml file referenced above has the current hostname
7. Verify that /var/log/etcd.log no longer has messages referring to not being able to connect to the old hostname
8. Run the following command to verify the status of the cluster:
  - /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
  - You should see a section named members that references the three ESXi host in the cluster that make up the etcd cluster:
    - "members": [
      {
      "peer_address": "<ESX_HOST1>:2380",
      "api_address": "<ESX_HOST1>:2379",
      "reachable": true,
      "primary": "no",
      "learner": false
      },
      {
      "peer_address": "<ESX_HOST2>:2380",
      "api_address": "<ESX_HOST2>:2379",
      "reachable": true,
      "primary": "yes",
      "learner": false
      },
      {
      "peer_address": "<ESX_HOST3>:2380",
      "api_address": "<ESX_HOST3>:2379",
      "reachable": true,
      "primary": "no",
      "learner": false
      }

Additional Information

In vSphere 8 environment, even if you are not using any K8S-related service or app, the cluster will choose three hosts as etcd nodes and enable the etcdClientComm and etcdPeerComm services.

We can use the command: /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status

to check the etcd cluster status on the ESXi host.