, Etcd service keeps crashing as soon as its starts.clusterAgent logs of the the problematic host, we see the error "connection reset by peer"/var/run/log/clusterAgent.log No(5) clusterAgent[3931896]: INFO Etcd client started watch {"opID": "kvwatch-tlspeertrust", "cli": "0xc0001e81a0", "key": "root/tlspeertrust"} No(5) clusterAgent[3931896]: INFO Etcd client started watch {"opID": "kvwatch-votingmembersupdated", "cli": "0xc0001e81a0", "key": "root/votingmembersupdated"} No(5) clusterAgent[3931896]: WARN grpc: addrConn.createTransport failed to connect to {ESXi-FQDN:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp ESXi-FQDN-IP:28383->ESXi-FQDN-IP:2379: read: connection reset by peer". Reconnecting... No(5) clusterAgent[3931896]: WARN grpc: addrConn.createTransport failed to connect to {ESXi-FQDN:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: read tcp ESXi-FQDN-IP:36502->ESXi-FQDN-IP:2379: read: connection reset by peer". Reconnecting...
watchdog.log, we see that the service keeps restarting./var/run/log/watchdog.logwatchdog[XXXXXXX]: Started etcdmain with PID=3931911watchdog[XXXXXXX]: Restarting etcdmainwatchdog[XXXXXXX]: Started etcdmain with PID=3931929watchdog[XXXXXXX]: Restarting etcdmainwatchdog[XXXXXXX]: Started etcdmain with PID=3931943watchdog[XXXXXXX]: Restarting etcdmainwatchdog[XXXXXXX]: Started etcdmain with PID=3931969watchdog[XXXXXXX]: Restarting etcdmainwatchdog[XXXXXXX]: Started etcdmain with PID=3931985watchdog[XXXXXXX]: Restarting etcdmainwatchdog[XXXXXXX]: Started etcdmain with PID=3931999
etcd.log/var/run/log/etcd.logIn(6) etcd[XXXXXXX]: added member 9e8cfbf3dbf0e555 [https://ESXi-FQDN:2380] to cluster 2fbf9a482d65ed67In(6) etcd[XXXXXXX]: starting peer 9e8cfbf3dbf0e555...In(6) etcd[XXXXXXX]: started HTTP pipelining with peer 9e8cfbf3dbf0e555
In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (writer)In(6) etcd[XXXXXXX]: removed member 9e8cfbf3dbf0e555 from cluster 2fbf9a482d65ed67In(6) etcd[XXXXXXX]: stopping peer 9e8cfbf3dbf0e555...In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (writer)In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (writer)In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (stream MsgApp v2 reader)In(6) etcd[XXXXXXX]: stopped HTTP pipelining with peer 9e8cfbf3dbf0e555In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (stream MsgApp v2 reader)In(6) etcd[XXXXXXX]: started streaming with peer 9e8cfbf3dbf0e555 (stream Message reader)In(6) etcd[XXXXXXX]: stopped streaming with peer 9e8cfbf3dbf0e555 (stream Message reader)In(6) etcd[XXXXXXX]: stopped peer 9e8cfbf3dbf0e555In(6) etcd[XXXXXXX]: removed peer 9e8cfbf3dbf0e555
Admission failure in path: host/vim/vmvisor/etcd:etcd" in vmkernel.log
/var/run/log/vmkernel.logvmkernel: cpu12:3932082)Admission failure in path: host/vim/vmvisor/etcd:etcd.3932076:uw.3932076vmkernel: cpu12:3932082)UserWorld 'etcd' 3932076 with cmdline '/usr/lib/vmware/etcd/bin/etcd --config-file=/var/cache/datafiles/esx#cluster_agent_data/etcd.yml', parent 2097917vmkernel: cpu12:3932082)started from 'init' 2097917 with cmdline '/bin/init', parent 0vmkernel: cpu12:3932082)uw.3932076 (10380427) requires 4096 KB, asked 4096 KB from etcd (6977) which has 193788 KB occupied and 2820 KB available.vmkernel: cpu84:3932095)Admission failure in path: host/vim/vmvisor/etcd:etcd.3932093:uw.3932093vmkernel: cpu84:3932095)UserWorld 'etcd' 3932093 with cmdline '/usr/lib/vmware/etcd/bin/etcd --config-file=/var/cache/datafiles/esx#cluster_agent_data/etcd.yml', parent 2097917vmkernel: cpu84:3932095)started from 'init' 2097917 with cmdline '/bin/init', parent 0vmkernel: cpu84:3932095)uw.3932093 (10380454) requires 4096 KB, asked 4096 KB from etcd (6977) which has 192872 KB occupied and 3736 KB available
/usr/lib/vmware/clusterAgent/bin/clusterAdmin cluster status
"state": "hosted""cluster_id": "ebbbcf4f-8eae-4fe8-85e8-d197a4ffe1c7: domain-c952432", "is_in_alarm": false,"alarm_cause": "","is_in_cluster": true,"members": {"available": true},"namespaces": [{"name": "root","up_to_date": true,"members": ["peer_address": "ESXi1:2380","api_address": "ESXi1:2379","reachable": true,"primary": "yes","learner": false},{"peer_address": "ESXi2:2380","api_address":"ESXi2:2379","reachable": true,"primary": "no","learner": false},{"peer_address": "ESXi3:2388", "api_address": "ESXi3:2379", "reachable": false,"primary": "unknown",
"learner": false}
:/var/run/log/vmkwarning.log
YYYY-MM-DDT06:32:00.578Z XXXXXX vmkwarning: cpu15:23788530)WARNING: World: 3234: Could not allocate new world handle for world ID: 23801192: Out of memory
vSphere ESXi 8.0U3f or earlier
vSphere ESXi 9.0
Etcd service runs out of memory and keeps crashing.
This is a known issue impacting ESXi 8.0 (Update 3f or earlier) which is resolved in ESXi 8.0U3g and for ESX 9.0 it is fixed in ESXi 9.0.1
Workaround:
To work around this issue, you will need to restart ESXi services using the KB below: