"Admission failure in path: host/vim/vmvisor/etcd:etcd", frequent Etcd crash on the ESXi host causing too many DNS queries
search cancel

"Admission failure in path: host/vim/vmvisor/etcd:etcd", frequent Etcd crash on the ESXi host causing too many DNS queries

book

Article ID: 387913

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • A massive influx of DNS queries from one or more ESXI hosts is overloading the DNS server.
  • On the problematic host, Etcd service keeps crashing as soon as its starts.
  • In the clusterAgent logs of the the problematic host, we see the error "connection reset by peer"

    /var/run/log/clusterAgent.log

     No(5) clusterAgent[3931896]: INFO  Etcd client started watch       {"opID": "kvwatch-tlspeertrust", "cli": "0xc0001e81a0", "key": "root/tlspeertrust"}
     No(5) clusterAgent[3931896]: INFO  Etcd client started watch       {"opID": "kvwatch-votingmembersupdated", "cli": "0xc0001e81a0", "key": "root/votingmembersupdated"}
     No(5) clusterAgent[3931896]: WARN  grpc: addrConn.createTransport failed to connect to {ESXi-FQDN:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp ESXi-FQDN-IP:28383->ESXi-FQDN-IP:2379: read: connection reset by peer". Reconnecting...
     No(5) clusterAgent[3931896]: WARN  grpc: addrConn.createTransport failed to connect to {ESXi-FQDN:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp ESXi-FQDN-IP:36502->ESXi-FQDN-IP:2379: read: connection reset by peer". Reconnecting...

  • In watchdog.log, we see that the service keeps restarting.

    /var/run/log/watchdog.log

watchdog[XXXXXXX]: Started etcdmain with PID=3931911
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=3931929
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=3931943
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=3931969
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=3931985
watchdog[XXXXXXX]: Restarting etcdmain
watchdog[XXXXXXX]: Started etcdmain with PID=3931999

  • we see below entries in etcd.log

    /var/run/log/etcd.log

In(6) etcd[XXXXXXX]: added member 9e8cfbf3dbf0e555 [https://ESXi-FQDN:2380] to cluster 2fbf9a482d65ed67
In(6) etcd[XXXXXXX]: starting peer 9e8cfbf3dbf0e555...
In(6) etcd[XXXXXXX]: started HTTP pipelining with peer 9e8cfbf3dbf0e555

In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (writer)
In(6) etcd[XXXXXXX]: removed member 9e8cfbf3dbf0e555 from cluster 2fbf9a482d65ed67
In(6) etcd[XXXXXXX]: stopping peer 9e8cfbf3dbf0e555...
In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (writer)
In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (writer)
In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (stream MsgApp v2 reader)
In(6) etcd[XXXXXXX]: stopped HTTP pipelining with peer 9e8cfbf3dbf0e555
In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (stream MsgApp v2 reader)
In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (stream Message reader)
In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (stream Message reader)
In(6) etcd[XXXXXXX]: stopped peer 9e8cfbf3dbf0e555
In(6) etcd[XXXXXXX]: removed peer 9e8cfbf3dbf0e555

  • At the same time, we see that "Admission failure in path: host/vim/vmvisor/etcd:etcd" in vmkernel.log

    /var/run/log/vmkernel.log

vmkernel: cpu12:3932082)Admission failure in path: host/vim/vmvisor/etcd:etcd.3932076:uw.3932076
vmkernel: cpu12:3932082)UserWorld 'etcd' 3932076 with cmdline '/usr/lib/vmware/etcd/bin/etcd --config-file=/var/cache/datafiles/esx#cluster_agent_data/etcd.yml', parent 2097917
vmkernel: cpu12:3932082)started from 'init' 2097917 with cmdline '/bin/init', parent 0
vmkernel: cpu12:3932082)uw.3932076 (10380427) requires 4096 KB, asked 4096 KB from etcd (6977) which has 193788 KB occupied and 2820 KB available.
vmkernel: cpu84:3932095)Admission failure in path: host/vim/vmvisor/etcd:etcd.3932093:uw.3932093
vmkernel: cpu84:3932095)UserWorld 'etcd' 3932093 with cmdline '/usr/lib/vmware/etcd/bin/etcd --config-file=/var/cache/datafiles/esx#cluster_agent_data/etcd.yml', parent 2097917
vmkernel: cpu84:3932095)started from 'init' 2097917 with cmdline '/bin/init', parent 0
vmkernel: cpu84:3932095)uw.3932093 (10380454) requires 4096 KB, asked 4096 KB from etcd (6977) which has 192872 KB occupied and 3736 KB available

  • Cluster status for the problematic ESXi shows the reachable status as False.  

/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status

"state": "hosted"
"cluster_id": "ebbbcf4f-8eae-4fe8-85e8-d197a4ffe1c7: domain-c952432",
"is_in_alarm": false,
"alarm_cause": "",
"is_in_cluster": true,
"members": {
"available": true
},
"namespaces": [
{
"name": "root",
"up_to_date": true,
"members": [
"peer_address": "ESXi1:2380",
"api_address": "ESXi1:2379",
"reachable": true,
"primary": "yes",
"learner": false
},
{
"peer_address": "ESXi2:2380",
"api_address":
"ESXi2:2379",
"reachable": true,
"primary": "no",
"learner": false
},
{
"peer_address": "ESXi3:2388",
"api_address": "ESXi3:2379",
"reachable": false,
"primary": "unknown",

"learner": false
}

  • This is also applicable for scenarios were you see multiple alerts on the Host regarding the out of memory on the host:

/var/run/log/vmkwarning.log

YYYY-MM-DDT06:32:00.578Z XXXXXX vmkwarning: cpu15:23788530)WARNING: World: 3234: Could not allocate new world handle for world ID: 23801192: Out of memory

Environment

vSphere ESXi 8.0U3f or earlier

vSphere ESXi 9.0

Cause

Etcd service runs out of memory and keeps crashing.

Resolution

This is a known issue impacting ESXi 8.0 (Update 3f or earlier) which is resolved in ESXi 8.0U3g and for ESX 9.0 it is fixed in ESXi 9.0.1

Workaround:

To work around this issue, you will need to restart ESXi services using the KB below: