ESXi hosts constantly sending DNS queries to the DNS server

Products

VMware vSphere ESX 8.x

Issue/Introduction

ESXi hosts constantly sending DNS queries to DNS server
DNS server is getting engaged and makes it slow responding to other DNS queries
/var/run/log/vmkernel.log on ESXi host shows similar logging as below

YYYY-MM-DDTHH:MM:SS.882Z In(182) vmkernel: cpu##:9#####8)Admission failure in path: host/vim/vmvisor/etcd:etcd.9#####7:uw.9#####7

/var/run/log/clusterAgent.log on ESXi might shows any of the log entries below:

YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: E
rror while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.483Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN  grpc: addrConn.createTransport failed to connect to {hostfqdn:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.490Z      WARN    clientv3/retry_interceptor.go:62        retrying of unary invoker failed        {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = InvalidArgument desc = etcdserver: authentication failed, invalid user ID or password"}

/var/run/log/etcd.log:

Er(3) etcd[######]: failed to find member ############## in cluster ##############
Er(3) etcd[######]: failed to find member ############## in cluster ##############
etcd[######]: peer e6eaf0202e2e2ba4 became inactive (message send to peer failed)
etcd[######]: failed to dial ############## on stream MsgApp v2 (peer ############## failed to find local node ##############)
etcd[######]: failed to dial ############## on stream Message (peer ############## failed to find local node ##############)

Additional symptoms reported

Events report "The file table of the ramdisk 'tmp' is full. As a result, the file /tmp/Go.[file_name] could not be created by the application 'etcd'"
ESXi hosts become "top talkers" on DNS servers by a large margin
High DNS query volume observed from specific hosts

Environment

vSphere ESXi 8.x

Cause

When a DKVS (distributed key-value store) cluster is in an error state, it is known to cause a lot of DNS traffic, as the replica hosts are constantly retrying their connections to each other.

Resolution

Recommended resolution

Upgrade to ESXi 8.0 Update 3g or later. The excessive DNS query behavior caused by DKVS error states has been significantly reduced in this release through a rate-limiter that reduces retry frequency and network overhead. However, the issue is not fully fixed. If DNS query volume remains unacceptable after upgrade, use one of the workaround options below.

Download ESXi 8.0 Update 3g or later from the Broadcom Support Portal.
Follow your standard ESXi upgrade or patching procedure.
After upgrade, monitor DNS query volume to confirm improvement.

If upgrading is not immediately possible, or if the issue persists after upgrade, use one of the workaround options below.

Confirmation

Before applying a workaround, verify whether DKVS is running on each affected ESXi host.

SSH to the ESXi host as root.

Run the following command:

/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status

Review the output to determine the DKVS state.

Example output when DKVS is running:

{
   "state": "hosted",
   "cluster_id": "############",
   "is_in_alarm": false,
   "alarm_cause": "",
   "is_in_cluster": true,
   "members": {
      "available": true
   },
   "namespaces": [
      {
         "name": "root",
         "up_to_date": true,
         "members": [
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "no",
               "learner": false
            },
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "yes",
               "learner": false
            },
            {
               "peer_address": "##.##.##.##:##",
               "api_address": "##.##.##.##:##",
               "reachable": true,
               "primary": "no",
               "learner": false
            }
         ]
      }
   ]
}

A cluster_id value indicates DKVS is running on the host.

Example output when DKVS is not running:

{
   "state": "standalone",
   "cluster_id": "",
   "is_in_alarm": false,
   "alarm_cause": "",
   "is_in_cluster": false,
   "members": {
      "available": false
   }
}

An empty cluster_id indicates DKVS is not running on the host.

If DKVS is running and upgrade is not possible or did not sufficiently reduce DNS traffic, proceed with one of the workaround options below.

Workaround option 1: Disable DKVS manually

Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.

Part A: Disable DKVS on vCenter

SSH to the vCenter Server as root.

Disable DKVS by running:

/usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg set vpxd/clusterStore/globalDisable true

Restart the vpxd service:
```
vmon-cli -r vpxd
```

Part B: Clear DKVS settings on each ESXi host

After disabling DKVS on vCenter, it may be necessary to clear the DKVS settings on each ESXi host in the cluster.

SSH to the ESXi host as root.
Stop the clusterAgent service:
```
/etc/init.d/clusterAgent stop
```

Remove the clusterAgent data file:

configstorecli files datafile delete -c esx -k cluster_agent_data

Remove the clusterAgent data directory:

configstorecli files datadir delete -c esx -k cluster_agent_data

Repeat steps 1-4 on each ESXi host in the cluster.
Return to the vCenter Server and restart the vpxd service:
```
vmon-cli -r vpxd
```

Part C: Additional options after disabling DKVS

After disabling DKVS, you can also reduce DNS query volume by using one of the following methods:

Add the ESXi hosts to vCenter using the ESXi host IP address instead of the FQDN.
Add mappings from the ESXi host FQDNs to their IP addresses in /etc/hosts on each ESXi host in the cluster.

Workaround option 2: Disable DKVS using script

Use the attached Python script to disable DKVS on the vCenter where affected ESXi hosts are connected.

Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.

Take a snapshot of the vCenter Server.
Download the attached dkvs-cleanup.py script.
SSH to the vCenter Server as root.
Upload the script to the vCenter Server.
Run the script:
```
python3 dkvs-cleanup.py -d disable -w all-soft -s restart
```
Note: The vpxd service will restart during script execution. Running tasks will stop and vCenter will be unavailable for approximately 2 minutes.
Verify DKVS is disabled by running:
```
/usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg get vpxd/clusterStore/globalDisable
```
- Output of true indicates DKVS is disabled.
- Output of false or Key not found indicates DKVS is still enabled. Re-run the script if needed.
Monitor DNS query volume to confirm the issue is resolved.

Re-enabling DKVS after a future fix

Once a future release fully addresses this issue, DKVS can be re-enabled using:

python3 dkvs-cleanup.py -d enable -w actions-soft -s restart

Attachments

dkvs-cleanup.py get_app