The etcd service on the ESXi host fails to start after the host is renamed
search cancel

The etcd service on the ESXi host fails to start after the host is renamed

book

Article ID: 385110

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • After an ESXi host is renamed, the etcd service will fail since it's not automatically reconfigured to use the new hostname.

  • The following log entries may appear:
    • /var/run/log/etcd.log
      • health check for peer <ETCD_MEMBER> could not connect: dial tcp: lookup <ESXi_HOST> on <ESXi_HOST_IP>:53: no such host
    • /var/run/log/vmkernel.log
      • cpu84:(1001394011)VmkAccess: SocketInetConnect:149: etcd: running in etcdDom(49): ipAddr = <IPV6_ADDRESS>::, port = 9: Access denied by vmkernel access control policy
    • /var/run/log/clusterAgent.log
      • No(5) clusterAgent[525412]: WARN   grpc: addrConn.createTransport failed to connect to <ESXi_HOST>:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup <ESXi_HOST>: no such host". Reconnecting...
      • YYYY-MM-DDTHH:MM:SSZ No(5) clusterAgent[525412]: ERROR  Failed to prepare supervisor state

         

  • The ESXi host's etcd configuration file (/vmfs/volumes/#######-####-######/cache/datafiles/esx#cluster_agent_data/etcd.yml) will reference the host's previous name:
    • initial-cluster: <ETCD_CLUSTER_ID>=https://<OLD_HOSTNAME>:2380

  • Running /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status will show that the etcd cluster is in an alarm state and can't reach its members:

    • {
         "state": "hosted",
         "cluster_id": "<CLUSTER_ID>",
         "is_in_alarm": true,
         "alarm_cause": "Timeout",
         "is_in_cluster": true,
         "members": {
            "available": false
         }
      }

       

  • This issue can occur even if WCP/TKG is not in use.

Environment

VMware vSphere ESXi 8.x

Cause

  • When a ESXi host is renamed, its etcd configuration isn't updated automatically. This causes the ESXi hosts in the cluster that are etcd members to attempt to contact each other using their old host names. If there isn't a DNS record for these host names, communication will fail and generate messages in /var/run/log/etcd.log

Resolution

  • Each vCenter cluster will contain three ESXi hosts that are etcd cluster members. The other ESXi hosts in a cluster will not be etcd cluster members so they may not have the etcd.yml configuration file and the commands below do not need to be run on them.

  • Connect to each ESXi host via SSH and find the etcd.yml configuration file:
    • find /vmfs/volumes -name etcd.yml -print | head -n 1
    • /vmfs/volumes/<VMFS_VOLUME_ID>/cache/datafiles/esx#cluster_agent_data/etcd.yml

  • Verify that the configuration file references the old hostname:
    • initial-cluster: <ETCD_CLUSTER_ID>=https://<OLD_HOSTNAME>:2380
    • There will be other lines with the old hostname as well. Do not manually update these entries.

  • Run the following command to make the etcd node to run in standalone mode:
    • /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster forceStandalone (Successful execution results in no output being generated)

  • Once you've set all three ESXi hosts that are etcd members to standalone mode, restart the vpxd service on vCenter:
    • service-control --restart vpxd

  • Once vpxd has finished restarting, verify that the etcd.yaml file referenced above has the current hostname.

  • Verify that /var/log/etcd.log no longer has messages referring to not being able to connect to the old hostname

  • Run the following command to verify the status of the cluster:
    • /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
    • You should see a section named members that references the three ESXi host in the cluster that make up the etcd cluster:
      • "members": [
                    {
                       "peer_address": "<ESX_HOST1>:2380",
                       "api_address": "<ESX_HOST1>:2379",
                       "reachable": true,
                       "primary": "no",
                       "learner": false
                    },
                    {
                       "peer_address": "<ESX_HOST2>:2380",
                       "api_address": "<ESX_HOST2>:2379",
                       "reachable": true,
                       "primary": "yes",
                       "learner": false
                    },
                    {
                       "peer_address": "<ESX_HOST3>:2380",
                       "api_address": "<ESX_HOST3>:2379",
                       "reachable": true,
                       "primary": "no",
                       "learner": false
                    }

Additional Information

In a vSphere 8 environment, even if no K8S-related services or applications are in use, the cluster will still designate three hosts as etcd nodes and activate the etcdClientComm and etcdPeerComm services on them.

The command to check the etcd cluster status on the ESXi host : /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status