A TKGi cluster upgrade sometimes can get stuck in a worker node's drain/pre-stop phase as follows:
Task <task-id> | 15:42:13 | L executing pre-stop: worker/<worker-id> (0) (canary)
Checking the status of the node will show that it's being drained with status "Ready,SchedulingDisabled":
# kubectl get node
NAME STATUS ROLES AGE VERSION
<node-name> Ready,SchedulingDisabled <none> 45h v1.29.6+vmware.1
Some errors that can be found in the node's "/var/vcap/sys/log/kubelet/drain.stderr.log" are:
Cannot evict pod as it would violate the pod's disruption budget
or
Kill container failed. context deadline exceeded
This article outlines some troubleshooting steps to take to identify and resolve the issue.
There're many possible causes why the upgrade can get stuck in a worker node's drain/pre-stop phase.
Some of them include:
# bosh -d <service-instance_id> ssh <worker-node_id>
# kubectl get pdb -A
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
<pdb-name> 2 N/A 0 7s
# kubectl get po -A -owide | grep <node-name-from-kubectl-get-node-command>
# kubectl delete po <pod-name> -n <namespace> --force
# bosh -d <service-instance_id> ssh <worker-node_id>
# crictl ps -a
# tkgi upgrade-cluster <cluster-name>
# crictl stop <container-id> --force
# crictl rm <container-id>
# reboot now
) or from vCenter.# crictl ps -a
# tkgi upgrade-cluster <cluster-name>