PKS cluster upgrade stuck at draining the node

Products

VMware Cloud PKS

Issue/Introduction

Symptoms:
↵

PKS cluster upgrade will get stuck at draining the node and it fails.
You see messages similar to the following:

I, [2019-02-23T13:43:35.524172 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Applying VM state
I, [2019-02-23T13:45:14.166716 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Running pre-start for worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3)
I, [2019-02-23T13:45:27.234148 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Starting instance worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3)
I, [2019-02-23T13:45:27.707511 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Waiting for 10.0 seconds to check worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) status
I, [2019-02-23T13:45:37.708100 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Checking if worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) has been updated after 10.0 seconds
I, [2019-02-23T13:45:37.723070 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Waiting for 15.0 seconds to check worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) status
I, [2019-02-23T13:45:52.723546 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Checking if worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) has been updated after 15.0 seconds
I, [2019-02-23T13:45:52.740554 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Waiting for 15.0 seconds to check worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) status
I, [2019-02-23T13:46:07.740966 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Checking if worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) has been updated after 15.0 seconds
I, [2019-02-23T13:46:07.760749 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Waiting for 15.0 seconds to check worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) status
I, [2019-02-23T13:46:22.761471 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Checking if worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3) has been updated after 15.0 seconds
I, [2019-02-23T13:46:22.778678 #15670] [instance_update(worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3))] INFO -- DirectorJobRunner: Running post-start for worker/839d39f5-c434-43c5-ba74-bd7bc7bde149 (3)
I, [2019-02-23T13:46:49.969120 #15670] [instance_update(worker/d30b3ac2-1ac9-47a5-b45d-c3bcc6aed5e1 (15))] INFO -- DirectorJobRunner: Updating instance worker/d30b3ac2-1ac9-47a5-b45d-c3bcc6aed5e1 (15), changes: "stemcell, packages, configuration, job"
I, [2019-02-23T13:46:49.995086 #15670] [instance_update(worker/d30b3ac2-1ac9-47a5-b45d-c3bcc6aed5e1 (15))] INFO -- DirectorJobRunner: Running drain for worker/d30b3ac2-1ac9-47a5-b45d-c3bcc6aed5e1 (15)

Environment

VMware PKS 1.x

Cause

Resolution

You need to manually clear the pods on specific nodes after identifying any apps are deployed using Daemon sets or Replica sets.

bosh task <task id> --debug | grep INFO > to get the current status, if its stuck with drain, proceed with below.

kubectl get nodes -o wide (Will give you the details about node ID)
kubectl get pod --all-namespaces -o wide | grep <Node ID>

SSH to the node which is stuck in drain state and stop Kubelet and docker service

kubectl delete pods 'pod name' -n namespace --grace-period=0 --force (Most of the time, after deleting failed pods the upgrade will automatically proceed

If after deleting the failed pods the issue persists, then manually drain the node using following command:

kubectl drain <Node ID> --force --ignore-daemonsets

Note: Before you delete anything, ensure you have documented the deployment is used to deploy the pods.

If it is daemonset then you know need to worry because daemonset will maintain the pod copies as per the number of nodes
If it is Replica sets then make sure at least one set of app is available always until your devops team confirms you can shutdown for time being

kubectl get pods --all-namespaces -o wide (To know how many copies are running)
kubectl get ds --all-namespaces (To know what type of deployment it is)
kubectl get deployment nginx -o yaml > first.yaml (To read the deployment file)

Workaround:

Additional Information

Impact/Risks: