YYYY-MM-DDTHH:MM:SS.882Z In(182) vmkernel: cpu##:9#####8)Admission failure in path: host/vim/vmvisor/etcd:etcd.9#####7:uw.9#####7
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN grpc: addrConn.createTransport failed to connect to {hostfqdn:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: E
rror while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN grpc: addrConn.createTransport failed to connect to {hostfqdn:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: operation was canceled". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN grpc: addrConn.createTransport failed to connect to {hostfqdn:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.483Z WARN clientv3/retry_interceptor.go:62 retrying of unary invoker failed {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = Unauthenticated desc = etcdserver: invalid auth token"}
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: WARN grpc: addrConn.createTransport failed to connect to {hostfqdn:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp hostip:2379: connect: connection refused". Reconnecting...
YYYY-MM-DDTHH:MM:SS No(5) clusterAgent[#####]: 2025-06-04T05:05:43.490Z WARN clientv3/retry_interceptor.go:62 retrying of unary invoker failed {"target": "endpoint://client-#####-####-###-####-########/hostfqdn:2379", "attempt": 0, "error": "rpc error: code = InvalidArgument desc = etcdserver: authentication failed, invalid user ID or password"}
Er(3) etcd[######]: failed to find member ############## in cluster ##############
Er(3) etcd[######]: failed to find member ############## in cluster ##############
etcd[######]: peer e6eaf0202e2e2ba4 became inactive (message send to peer failed)
etcd[######]: failed to dial ############## on stream MsgApp v2 (peer ############## failed to find local node ##############)
etcd[######]: failed to dial ############## on stream Message (peer ############## failed to find local node ##############)
Additional symptoms reported
vSphere ESXi 8.x
When a DKVS (distributed key-value store) cluster is in an error state, it is known to cause a lot of DNS traffic, as the replica hosts are constantly retrying their connections to each other.
Upgrade to ESXi 8.0 Update 3g or later. The excessive DNS query behavior caused by DKVS error states has been significantly reduced in this release through a rate-limiter that reduces retry frequency and network overhead. However, the issue is not fully fixed. If DNS query volume remains unacceptable after upgrade, use one of the workaround options below.
If upgrading is not immediately possible, or if the issue persists after upgrade, use one of the workaround options below.
Before applying a workaround, verify whether DKVS is running on each affected ESXi host.
/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
Example output when DKVS is running:
{
"state": "hosted",
"cluster_id": "############",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": true,
"members": {
"available": true
},
"namespaces": [
{
"name": "root",
"up_to_date": true,
"members": [
{
"peer_address": "##.##.##.##:##",
"api_address": "##.##.##.##:##",
"reachable": true,
"primary": "no",
"learner": false
},
{
"peer_address": "##.##.##.##:##",
"api_address": "##.##.##.##:##",
"reachable": true,
"primary": "yes",
"learner": false
},
{
"peer_address": "##.##.##.##:##",
"api_address": "##.##.##.##:##",
"reachable": true,
"primary": "no",
"learner": false
}
]
}
]
}
A cluster_id value indicates DKVS is running on the host.
Example output when DKVS is not running:
{
"state": "standalone",
"cluster_id": "",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": false,
"members": {
"available": false
}
}
An empty cluster_id indicates DKVS is not running on the host.
If DKVS is running and upgrade is not possible or did not sufficiently reduce DNS traffic, proceed with one of the workaround options below.
Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.
Part A: Disable DKVS on vCenter
/usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg set vpxd/clusterStore/globalDisable true
vmon-cli -r vpxd
Part B: Clear DKVS settings on each ESXi host
After disabling DKVS on vCenter, it may be necessary to clear the DKVS settings on each ESXi host in the cluster.
/etc/init.d/clusterAgent stop
configstorecli files datafile delete -c esx -k cluster_agent_data
configstorecli files datadir delete -c esx -k cluster_agent_data
vmon-cli -r vpxd
Part C: Additional options after disabling DKVS
After disabling DKVS, you can also reduce DNS query volume by using one of the following methods:
Use the attached Python script to disable DKVS on the vCenter where affected ESXi hosts are connected.
Note: The DKVS service is used during vCenter restore from backup to provide more up-to-date recovery. If you take regular vCenter backups, disabling DKVS is generally safe. If DKVS is disabled and the vCenter backup differs from the current host inventory configuration (host membership, credentials, DVS state), vCenter recovery may cause host disconnects that require manual reconnection.
Take a snapshot of the vCenter Server.
Download the attached dkvs-cleanup.py script.
SSH to the vCenter Server as root.
Upload the script to the vCenter Server.
Run the script:
python3 dkvs-cleanup.py -d disable -w all-soft -s restart
Note: The vpxd service will restart during script execution. Running tasks will stop and vCenter will be unavailable for approximately 2 minutes.
Verify DKVS is disabled by running:
/usr/lib/vmware-vpx/py/xmlcfg.py -f /etc/vmware-vpx/vpxd.cfg get vpxd/clusterStore/globalDisable
Monitor DNS query volume to confirm the issue is resolved.
Once a future release fully addresses this issue, DKVS can be re-enabled using:
python3 dkvs-cleanup.py -d enable -w actions-soft -s restart