Aria Automation 8.x: systemd services for kubernetes, "etcd". "kube-apiserver" and "kubelet" will not start, meaning the cluster cannot boot up.

search cancel

Aria Automation 8.x: systemd services for kubernetes, "etcd". "kube-apiserver" and "kubelet" will not start, meaning the cluster cannot boot up.

book

Article ID: 380701

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

etcd.service & kube-apiserver.service will not start, even if restarted with systemctl. You may see these repeated failures in the tty (VM console):

Thus kubelet cannot connect to the k8s system (it will also fail if restarted) and deploy.sh cannot run.
docker service is running, however listing the active images with docker ps yields no results.
The systemd journal may show the following errors from the etcd service:
- failed to detect default host (could not find default route)
- the server is already initialized as member before, starting as etcd member...
- /health error; no leader (status code 503)
- curl: (22) The requested URL returned error: 503

Environment

VMware Aria Automation 8.x

Cause

This appears to occur in clustered environments where the former etcd leader is offline and a "leadership election" can't occur.

This may be seen in the systemd journal for etcd using the journalctl command:

Jan 01 08:54:32 vranode.example.com etcd[721]: failed to detect default host (could not find default route)
Jan 01 08:54:32 vranode.example.com etcd[721]: the server is already initialized as member before, starting as etcd member...

If the leader is inaccessible, the other nodes can get stuck in this loop, waiting:

Jan 01 08:54:34 vranode.example.com etcd[721]: /health error; no leader (status code 503)
Jan 01 08:54:34 vranode.example.com etcd[721]: curl: (22) The requested URL returned error: 503

The currently-offline node may see the following error in its journal, indicating that the VM has no connected NIC and the OS itself cannot do anything about this:

unexpected command output Device "eth0" does not exist

Resolution

This issue has been seen to occur on the VM-physical level (i.e. the VM's virtual hardware in vSphere).

The following workaround steps may help to resolve the issue:

Check the network connection to and from each of the Automation nodes.
- If it is found that network traffic such as pings, ssh connections or other test connections cannot get through, then access must be restored at this level
- Command to test connection over given port number: curl -kv telnet://<NODE_FQDN>:<PORT_NUMBER>
Failing the above tests, you may try any of the following steps to restore network connection to the VM in vSphere:
- Power off the node fully, leave for at least a few seconds and then boot it back up.
- Between power off and power on (as above), toggle the NIC "connect at power on", leaving it enabled and click SAVE.
- vMotion the VM to a different ESXi host (and optionally move it back, as desired).
- If the following (0 byte) file exists on the filesystem, you can try deleting it before next reboot attempt:
  - rm -f /var/vmware/prelude/docker/last-cleanup
If network connectivity is restored to all nodes, but the etcd.service & kube-apiserver.service still do not start up, please try the following alternative workarounds:
- Ensure the /etc/hosts file on all nodes contains the usual 127.0.0.1 entries for localhost and vra-k8s.local - copy this from other nodes if necessary: as in KB 314799.
- Remove known_hosts and authorized_keys files in the /home/root/.ssh directory: steps as in KB 326063

Feedback

thumb_up Yes

thumb_down No