ESXi hosts constantly sending DNS queries to the DNS server
search cancel

ESXi hosts constantly sending DNS queries to the DNS server

book

Article ID: 385346

calendar_today

Updated On:

Products

VMware vSphere ESX 8.x

Issue/Introduction

  • ESXi hosts constantly sending DNS queries to DNS server
  • DNS server is getting engaged and makes it slow responding to other DNS queries
  • /var/run/log/vmkernel.log on ESXi host shows similar logging as below
YYYY-MM-DDTHH:MM:SS.882Z In(182) vmkernel: cpu##:9#####8)Admission failure in path: host/vim/vmvisor/etcd:etcd.9#####7:uw.9#####7
  • /var/run/log/clusterAgent.log on ESXi might shows any of the log entries below:
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: E
rror while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.483Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.490Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = InvalidArgument desc = etcdserver: authentication failed, invalid user ID or password"}
  • /var/run/log/etcd.log:
Er(3) etcd[######]: failed to find member ############## in cluster ##############
Er(3) etcd[######]: failed to find member ############## in cluster ##############
etcd[######]: peer e6eaf0202e2e2ba4 became inactive (message send to peer failed)
etcd[######]: failed to dial ############## on stream MsgApp v2 (peer ############## failed to find local node ##############)
etcd[######]: failed to dial ############## on stream Message (peer ############## failed to find local node ##############)

Additional symptoms reported

  • Events report "The file table of the ramdisk 'tmp' is full. As a result, the file /tmp/Go.[file_name] could not be created by the application 'etcd'"
  • ESXi hosts become "top talkers" on DNS servers by a large margin
  • High DNS query volume observed from specific hosts

Environment

vSphere ESXi 8.x

Cause

When a DKVS (distributed key-value store) cluster is in an error state, it is known to cause a lot of DNS traffic, as the replica hosts are constantly retrying their connections to each other.

Resolution

Upgrade to ESXi 8.0 Update 3g or later. The excessive DNS query behavior caused by DKVS error states has been significantly reduced in this release through a rate-limiter that reduces retry frequency and network overhead. However, the issue is not fully fixed. If DNS query volume remains unacceptable after upgrade, use one of the workaround options below.

  1. Download ESXi 8.0 Update 3g or later from the Broadcom Support Portal.
  2. Follow your standard ESXi upgrade or patching procedure.
  3. After upgrade, monitor DNS query volume to confirm improvement.

If upgrading is not immediately possible, or if the issue persists after upgrade, use one of the workaround options below.

Confirmation

Before applying a workaround, verify whether DKVS is running on each affected ESXi host.

  1. SSH to the ESXi host as root.
  2. Run the following command:
    /usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
    
  3. Review the output to determine the DKVS state.

Example output when DKVS is running:

{
   "state": "hosted",
   "cluster_id": "############",
   "is_in_alarm": false,
   "alarm_cause": "",
   "is_in_cluster": true,
   "members": {
      "available": true
   },
   "namespaces": [
      {
         "name": "root",
         "up_to_date": true,
         "members": [
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "no",
               "learner": false
            },
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "yes",
               "learner": false
            },
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "no",
               "learner": false
            }
         ]
      }
   ]
}

A cluster_id value indicates DKVS is running on the host.

Example output when DKVS is not running:

{
   "state": "standalone",
   "cluster_id": "",
   "is_in_alarm": false,
   "alarm_cause": "",
   "is_in_cluster": false,
   "members": {
      "available": false
   }
}

An empty cluster_id indicates DKVS is not running on the host.

If DKVS is running and upgrade is not possible or did not sufficiently reduce DNS traffic, proceed with one of the workaround options below.


Workaround option 1: Disable DKVS manually

Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.

Part A: Disable DKVS on vCenter

  1. SSH to the vCenter Server as root.
  2. Disable DKVS by running:
    /usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg set vpxd/clusterStore/globalDisable true
    
  3. Restart the vpxd service:
    vmon-cli -r vpxd
    

Part B: Clear DKVS settings on each ESXi host

After disabling DKVS on vCenter, it may be necessary to clear the DKVS settings on each ESXi host in the cluster.

  1. SSH to the ESXi host as root.
  2. Stop the clusterAgent service:
    /etc/init.d/clusterAgent stop
    
  3. Remove the clusterAgent data file:
    configstorecli files datafile delete -c esx -k cluster_agent_data
    
  4. Remove the clusterAgent data directory:
    configstorecli files datadir delete -c esx -k cluster_agent_data
    
  5. Repeat steps 1-4 on each ESXi host in the cluster.
  6. Return to the vCenter Server and restart the vpxd service:
    vmon-cli -r vpxd
    

Part C: Additional options after disabling DKVS

After disabling DKVS, you can also reduce DNS query volume by using one of the following methods:

  • Add the ESXi hosts to vCenter using the ESXi host IP address instead of the FQDN.
  • Add mappings from the ESXi host FQDNs to their IP addresses in /etc/hosts on each ESXi host in the cluster.

Workaround option 2: Disable DKVS using script

Use the attached Python script to disable DKVS on the vCenter where affected ESXi hosts are connected.

Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.

  1. Take a snapshot of the vCenter Server.

  2. Download the attached dkvs-cleanup.py script.

  3. SSH to the vCenter Server as root.

  4. Upload the script to the vCenter Server.

  5. Run the script:

    python3 dkvs-cleanup.py -d disable -w all-soft -s restart
    

    Note: The vpxd service will restart during script execution. Running tasks will stop and vCenter will be unavailable for approximately 2 minutes.

  6. Verify DKVS is disabled by running:

    /usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg get vpxd/clusterStore/globalDisable
    
    • Output of true indicates DKVS is disabled.
    • Output of false or Key not found indicates DKVS is still enabled. Re-run the script if needed.
  7. Monitor DNS query volume to confirm the issue is resolved.


Re-enabling DKVS after a future fix

Once a future release fully addresses this issue, DKVS can be re-enabled using:

python3 dkvs-cleanup.py -d enable -w actions-soft -s restart

Attachments

dkvs-cleanup.py get_app