Aria Automation 8.x: systemd services for kubernetes, "etcd". "kube-apiserver" and "kubelet" will not start, meaning the cluster cannot boot up.
search cancel

Aria Automation 8.x: systemd services for kubernetes, "etcd". "kube-apiserver" and "kubelet" will not start, meaning the cluster cannot boot up.

book

Article ID: 380701

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

  • etcd.service & kube-apiserver.service will not start, even if restarted with systemctl. You may see these repeated failures in the tty (VM console):

  • Thus kubelet cannot connect to the k8s system (it will also fail if restarted) and deploy.sh cannot run.
  • docker service is running, however listing the active images with docker ps yields no results.
  • The systemd journal may show the following errors from the etcd service:
    • failed to detect default host (could not find default route)
    • the server is already initialized as member before, starting as etcd member...
    • /health error; no leader (status code 503)
    • curl: (22) The requested URL returned error: 503

Environment

VMware Aria Automation 8.x

Cause

This appears to occur in clustered environments where the former etcd leader is offline and a "leadership election" can't occur.

This may be seen in the systemd journal for etcd using the journalctl command:

Jan 01 08:54:32 vranode.example.com etcd[721]: failed to detect default host (could not find default route)
Jan 01 08:54:32 vranode.example.com etcd[721]: the server is already initialized as member before, starting as etcd member...

If the leader is inaccessible, the other nodes can get stuck in this loop, waiting:

Jan 01 08:54:34 vranode.example.com etcd[721]: /health error; no leader (status code 503)
Jan 01 08:54:34 vranode.example.com etcd[721]: curl: (22) The requested URL returned error: 503

The currently-offline node may see the following error in its journal, indicating that the VM has no connected NIC and the OS itself cannot do anything about this:

unexpected command output Device "eth0" does not exist

Resolution

This issue has been seen to occur on the VM-physical level (i.e. the VM's virtual hardware in vSphere).

The following workaround steps may help to resolve the issue:

  1. Check the network connection to and from each of the Automation nodes.
    • If it is found that network traffic such as pings, ssh connections or other test connections cannot get through, then access must be restored at this level
    • Command to test connection over given port number:  curl -kv telnet://<NODE_FQDN>:<PORT_NUMBER>
  2. Failing the above tests, you may try any of the following steps to restore network connection to the VM in vSphere:
    • Power off the node fully, leave for at least a few seconds and then boot it back up.
    • Between power off and power on (as above), toggle the NIC "connect at power on", leaving it enabled and click SAVE.
    • vMotion the VM to a different ESXi host (and optionally move it back, as desired).
    • If the following (0 byte) file exists on the filesystem, you can try deleting it before next reboot attempt:
      • rm -f /var/vmware/prelude/docker/last-cleanup
  3. If network connectivity is restored to all nodes, but the etcd.service & kube-apiserver.service still do not start up, please try the following alternative workarounds:
    • Ensure the /etc/hosts file on all nodes contains the usual 127.0.0.1 entries for localhost and vra-k8s.local - copy this from other nodes if necessary: as in KB 314799.
    • Remove known_hosts and authorized_keys files in the /home/root/.ssh directory: steps as in KB 326063