Aria Automation Appliance fails to boot with error: Failed to start kube-apiserver.service.
search cancel

Aria Automation Appliance fails to boot with error: Failed to start kube-apiserver.service.

book

Article ID: 416509

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • When performing a deploy.sh to restart the service pods, the process fails with the following error:
    Exit code of install/update of release catalog-service is 1
    + return 1
    + on_exit
    + '[' 123 -ne 0 ']'
    + echo 'Deployment failed. Collecting log bundle ...'
    Deployment failed. Collecting log bundle ...
  • When rebooting the appliance, upon startup the boot screen on the console shows the following error: Failed to start kube-apiserver.service.
  • When running command kubectl get pods -n prelude it returns with "No resources found in the prelude namespace"

Environment

Aria Automation 8.18.x

Cause

  • etcd instability causing the k8 clustering services to crash.
  • This is usually due to disk or networking issues. 
  • You will see lots of etcd restarts in the logs as well as network error messages like:
    Oct 01 23:08:28 <vRA FQDN> etcd[###]: health check for peer <UID> could not connect: dial tcp <IP Address>: connect: no route to host
  • When etcd reports not-ready for a while, systemd restarts it. Healthcheck failures look like:
    Oct 01 23:09:36 <vRA FQDN> etcd[###]: /health error; no leader (status code 503)

  • Additionally etcd will start timing out while accessing peers and disk operations become too slow:
    Sep 30 17:12:10 <vRA FQDN> etcd[#######]: rejected connection from "<IP Address>" (error "read tcp <IP Address>-><IP Address>: i/o timeout", ServerName "") 

    Sep 30 17:11:54 <vRA FQDN> etcd[#######]: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "range_response_count:196 size:2124781" took too long (175.29122ms) to execute 

    Sep 30 17:12:11 <vRA FQDN> etcd[#######]: /health error; QGET failed etcdserver: request timed out (status code 503) 

Resolution

Perform the following steps to restore the kube-system pods so that the appliance will fully boot up and start up all service pods.

  • Take a snapshot of the appliance to have a valid restore point
  • Shutdown the service pods with /opt/scripts/deploy.sh --shutdown
  • Stop the kubelet service on each node with command: service kubelet stop && service docker stop (do this on each node)
  • Start the kubelet service on each node with command: service docker start && service kubelet start (do this on each node)
  • Delete the kube-system pods on each node with command: kubectl delete pod -n kube-system --all (do this on each node)
  • Perform a Power OFF/ Power ON of the cluster/nodes from vRSLCM
  • If the deployment of the service pods in the Prelude namespace fail through the power on action from vRSLCM, perform a manual start of the service pods with command: /opt/scripts/deploy.sh