ESXi Hosts Constantly Sending DNS Queries to the DNS Server
search cancel

ESXi Hosts Constantly Sending DNS Queries to the DNS Server

book

Article ID: 385346

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x

Issue/Introduction

  • ESXi hosts constantly sending DNS queries to DNS server
  • DNS server is getting engaged and makes it slow responding to other DNS queries
  • Additional symptoms include Events report "The file table of the ramdisk 'tmp' is full. As a result, the file /tmp/Go.[file_name] could not be created by the application 'etcd'"
  • /var/run/log/vmkernel.log on ESXi host shows similar logging as below
        ####-##-##T##:##:##.882Z In(182) vmkernel: cpu##:9#####8)Admission failure in path: host/vim/vmvisor/etcd:etcd.9#####7:uw.9#####7
  • var/run/log/clusterAgent.log on ESXi might shows any of the log entries below:
 ####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: E
rror while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: E
rror while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
  • var/run/log/clusterAgent.log :
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: 2025-06-04T05:05:43.483Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: 2025-06-04T05:05:43.490Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = InvalidArgument desc = etcdserver: authentication failed, invalid user ID or password"}
####-##-##T##:##:## No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
####-##-##T##:##:## No(5) clusterAgent[#####]: 2025-06-04T05:05:43.492Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = InvalidArgument desc = etcdserver: authentication failed, invalid user ID or password"}
  • var/run/log/etcd.log:
Er(3) etcd[######]: failed to find member ############## in cluster ##############
Er(3) etcd[######]: failed to find member ############## in cluster ##############
etcd[######]: peer e6eaf0202e2e2ba4 became inactive (message send to peer failed)
etcd[######]: failed to dial ############## on stream MsgApp v2 (peer ############## failed to find local node ##############)
etcd[######]: failed to dial ############## on stream Message (peer ############## failed to find local node ##############)

Environment

vSphere ESXi 8.x

Cause

When a DKVS (Distributed Key-Value Store) cluster is in an error state, it is known to cause a lot of DNS traffic, as the replica hosts are constantly retrying their connections to each other.

Resolution

 This issue has been resolved in ESXi 8.0 Update 3g, Build 24859861.

Confirmation

  • Verify if the hosts are having the DKVS (Distributed Key-Value Store) service running on them by running the below command on each ESXi host:

    /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status

    Example of DKVS Running on the ESXi host

    [root@ESXi:/] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status

    {
       "state": "hosted",

       "cluster_id": "############",          >>>>>>>>>>>>>>>>  DKVS is running on the host
       "is_in_alarm": false,
       "alarm_cause": "",
       "is_in_cluster": true,
       "members": {
          "available": true
       },
       "namespaces": [
          {
             "name": "root",
             "up_to_date": true,
             "members": [
                {
                   "peer_address": "##.##.##.##:##",
                   "api_address": "##.##.##.##:##",
                   "reachable": true,
                   "primary": "no",
                   "learner": false
                },
                {
                   "peer_address": "##.##.##.##:##",
                   "api_address": "##.##.##.##:##",
                   "reachable": true,
                   "primary": "yes",
                   "learner": false
                },
                {
                   "peer_address": "##.##.##.##:##",
                   "api_address": "##.##.##.##:##",
                   "reachable": true,
                   "primary": "no",
                   "learner": false
                }
             ]
          }
       ]
    }

    Example of DKVS Not Running on the ESXi host

    [root@ESXi:/] /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
    {
       "state": "standalone",
       "cluster_id": "",          >>>>>>>>>>>>>>>>  DKVS is not running on the host as there is no cluster id     
      "is_in_alarm": false,
       "alarm_cause": "",
       "is_in_cluster": false,
       "members": {
          "available": false
       }
    }

If DKVS is enabled and running, below are 3 workaround options to resolve this issue.

Workaround Options 1:

  1. Disable DKVS in vCenter.
    • SSH to the vCenter via root
    • Disable DKVS

      /usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg set vpxd/clusterStore/globalDisable true


    • Restart the vpxd service

      vmon-cli -r vpxd

    • After disabling DKVS on vCenter, it may be necessary to clear the DKVS settings on the ESXi hosts
      • SSH to each of the ESXi hosts within the cluster via root
      • Stop the clusterAgent service

        /etc/init.d/clusterAgent stop
         
      • Remove the clusterAgent data file

        configstorecli files datafile delete -c esx -k cluster_agent_data

      • Remove the clusterAgent data directory

        configstorecli files datadir delete -c esx -k cluster_agent_data

      • Restart the vpxd service

        vmon-cli -r vpxd

  2. Add the ESXi hosts to vCenter using the ESXi's IP address instead of their FQDN names

  3. Add mapping from the ESXi hosts' FQDNs to their IP addresses in /etc/hosts on ESXi hosts (has to be done on each ESX host in a cluster)
Note: The DKVS service is used during a restore from backup of vCenter.  It becomes the source of truth if the VC backup differs from the host inventory configuration (host membership in a cluster, credentials, DVS State).  If this service is disabled, the recovery of VC server may cause host disconnects and would need to be reconnected to re-sync host data and configuration
 

Workaround Options 2:

Disable DKVS on the vCenter where affected ESXi hosts are connected using the attached Python script. 

  • Perform a snapshot of the vCenter before running the script and observe if the issue is mitigated
  • Disable and wipe DKVS on all current vCenters that manage hosts between version 8.0.0 and 8.0.3p05

. Python script is attached and run below command to run the python script in vCenter

 python3 dkvs-cleanup.py -d disable -w all-soft -s restart

 

  • Impact: While this script is running. vpxd service will restart and VC running tasks will be stopped and down for 2 minutes
  • You can verify with the below command if the DKVS is disabled. If the output "true", then DKVS is disabled. If it outputs "false" or "Key not found", then DKVS is enabled. If disabling is needed again, the above script can be re-ran

 

/usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg get vpxd/clusterStore/globalDisable

 

Recommendations:

  • The issue is expected to be resolved in future releases.
  • Once DKVS is confirmed to be fully fixed in a future releases, it can be safely re-enabled using the following command
python3 dkvs-cleanup.py -d enable -w actions-soft -s restart

Attachments

dkvs-cleanup.py get_app