When performing a deploy.sh to restart the service pods, the process fails with the following error: Exit code of install/update of release catalog-service is 1 + return 1 + on_exit + '[' 123 -ne 0 ']' + echo 'Deployment failed. Collecting log bundle ...' Deployment failed. Collecting log bundle ...
When rebooting the appliance, upon startup the boot screen on the console shows the following error: Failed to start kube-apiserver.service.
When running command kubectl get pods -n prelude it returns with "No resources found in the prelude namespace"
Environment
Aria Automation 8.18.x
Cause
etcd instability causing the k8 clustering services to crash.
This is usually due to disk or networking issues.
You will see lots of etcd restarts in the logs as well as network error messages like: Oct 01 23:08:28 <vRA FQDN> etcd[###]: health check for peer <UID> could not connect: dial tcp <IP Address>: connect: no route to host
When etcd reports not-ready for a while, systemd restarts it. Healthcheck failures look like: Oct 01 23:09:36 <vRA FQDN> etcd[###]: /health error; no leader (status code 503)
Additionally etcd will start timing out while accessing peers and disk operations become too slow: Sep 30 17:12:10 <vRA FQDN> etcd[#######]: rejected connection from "<IP Address>" (error "read tcp <IP Address>-><IP Address>: i/o timeout", ServerName "")
Sep 30 17:11:54 <vRA FQDN> etcd[#######]: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" limit:500 " with result "range_response_count:196 size:2124781" took too long (175.29122ms) to execute
Perform the following steps to restore the kube-system pods so that the appliance will fully boot up and start up all service pods.
Take a snapshot of the appliance to have a valid restore point
Shutdown the service pods with /opt/scripts/deploy.sh --shutdown
Stop the kubelet service on each node with command: service kubelet stop && service docker stop (do this on each node)
Start the kubelet service on each node with command: service docker start && service kubelet start (do this on each node)
Delete the kube-system pods on each node with command: kubectl delete pod -n kube-system --all (do this on each node)
Perform a Power OFF/ Power ON of the cluster/nodes from vRSLCM
If the deployment of the service pods in the Prelude namespace fail through the power on action from vRSLCM, perform a manual start of the service pods with command: /opt/scripts/deploy.sh