During a TKGI cluster upgrade the worker node hangs indefinitely

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

During the 'Running errand Upgrade all clusters errand for TKGI' execution step of the TKGI upgrade process (all versions) the verbose output in Ops Manager shows the following for a long time (hours) with no progress:

===== 2019-03-01 14:41:50 UTC Running "/usr/local/bin/bosh --no-color --non-interactive --tty --environment=10.193.90.11 --deployment=pivotal-container-service-8bc65453a0a0c8a92afe run-errand upgrade-all-service-instances"
Using environment '10.193.90.11' as client 'ops_manager'

Using deployment 'pivotal-container-service-8bc65453a0a0c8a92afe'

Task 12565

Task 12565 | 14:41:51 | Preparing deployment: Preparing deployment (00:00:01)
Task 12565 | 14:41:52 | Running errand: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979

The bosh tasks output will show two tasks running. The first task is the parent upgrade-all-services-instances errand and the second task is an additional task to 'create deployment' for the service instance deployment of the TKGI cluster.

ubuntu@Ops-man-2-3-7:~$ bosh tasks
Using environment '10.193.90.11' as client 'ops_manager'

ID     State       Started At                    Last Activity At              User                                            Deployment                                             Description                                                                                              Result
12566  processing  Fri Mar  1 14:41:54 UTC 2019  Fri Mar  1 14:41:54 UTC 2019  pivotal-container-service-8bc65453a0a0c8a92afe  service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81  create deployment                                                                                        -
12565  processing  Fri Mar  1 14:41:51 UTC 2019  Fri Mar  1 14:41:51 UTC 2019  ops_manager                                     pivotal-container-service-8bc65453a0a0c8a92afe         run errand upgrade-all-service-instances from deployment pivotal-container-service-8bc65453a0a0c8a92afe  -

2 tasks

The BOSH task output will show the redeployment of the service instance of the TKGI cluster. In the output you will see that a worker node is hung with "Updating worker instance".

ubuntu@Ops-man-2-3-7:~$ bosh task 12566
Using environment '10.193.90.11' as client 'ops_manager'

Task 12566

Task 12566 | 14:41:55 | Preparing deployment: Preparing deployment
Task 12566 | 14:41:56 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:41:57 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:41:57 | Warning: DNS address not available for the link provider instance: pivotal-container-service/d5bfedbe-ae18-4f39-98f0-4b4c94550979
Task 12566 | 14:42:08 | Preparing deployment: Preparing deployment (00:00:13)
Task 12566 | 14:43:08 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 12566 | 14:43:08 | Updating instance master: master/71697842-061b-450a-ac0b-73f04012a22a (0) (canary) (00:01:20)
Task 12566 | 14:44:28 | Updating instance master: master/4902c248-fd28-4129-8bca-5094c423fc73 (2) (00:01:09)
Task 12566 | 14:45:37 | Updating instance master: master/de2fda96-f396-4f8d-8bcb-c40306d4d88e (1) (00:01:18)
Task 12566 | 14:46:55 | Updating instance worker: worker/dfa10f94-e690-4249-8463-dc7d9fc3efe6 (0) (canary) (00:01:25)
Task 12566 | 14:48:20 | Updating instance worker: worker/39449084-e393-4fc4-a7b5-1ba613227012 (3)

If you BOSH SSH to the worker node, and check the kubelet drain logs, you can confirm that the kubelet is unable to drain or evict a pod from the worker node.

bosh -d service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81 ssh worker/39449084-e393-4fc4-a7b5-1ba613227012

worker/39449084-e393-4fc4-a7b5-1ba613227012:~$ sudo -i
worker/39449084-e393-4fc4-a7b5-1ba613227012:~# ps -ef | grep drain
root     18428   724  0 14:48 ?        00:00:00 bash /var/vcap/jobs/kubelet/bin/drain job_changed hash_changed docker
root     18476 18428  0 14:48 ?        00:00:00 kubectl --kubeconfig /var/vcap/jobs/kubelet/config/kubeconfig-drain drain -l bosh.id=39449084-e393-4fc4-a7b5-1ba613227012 --grace-period 10 --force --delete-local-data --ignore-daemonsets

worker/39449084-e393-4fc4-a7b5-1ba613227012:~# cd /var/vcap/sys/log/kubelet/
worker/39449084-e393-4fc4-a7b5-1ba613227012:/var/vcap/sys/log/kubelet# ls -l drain.stderr.log
-rw-r--r-- 1 root root 65019 Mar  1 15:24 drain.stderr.log
worker/39449084-e393-4fc4-a7b5-1ba613227012:/var/vcap/sys/log/kubelet# tail -f drain.stderr.log
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

You can also use --debug to get more information about the task that is running:

bosh task <task number> --debug |grep INFO

If the details of the offending pod (workload) is not clear from the drain.stderr.log, then these steps can be run to replicate the drain command and retrieve more information:

1. The following command will highlight the IPs of the Kubernetes nodes:

kubectl get nodes -o wide
NAME                                   STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
6d038635-17b5-419b-b94c-f0a72525c66b   Ready    <none>   4d    v1.12.4   10.193.90.98   10.193.90.98   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
77c0e5c7-546b-46a7-8e47-8908687980f5   Ready    <none>   4d    v1.12.4   10.193.90.95   10.193.90.95   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
7e3ecf57-f2ff-4f69-b503-346cc5c93cea   Ready    <none>   4d    v1.12.4   10.193.90.97   10.193.90.97   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
9a107784-4ed5-4557-b6e3-4b43515341b5   Ready    <none>   4d    v1.12.4   10.193.90.96   10.193.90.96   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1
d5bfc2b0-223f-4cdd-a173-f1acba6fd07a   Ready    <none>   4d    v1.12.4   10.193.90.94   10.193.90.94   Ubuntu 16.04.5 LTS   4.15.0-42-generic   docker://18.6.1

2. Find the IP that matches the IP of the work node that is hung:

bosh -d service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81 vms

Deployment 'service-instance_4b8ad40a-6c1a-4a22-9c3c-1330422ddb81'

Instance                                     Process State  AZ   IPs           VM CID                                   VM Type      Active
master/4902c248-fd28-4129-8bca-5094c423fc73  running        az1  10.193.90.92  vm-07eb1792-ec2d-44c6-a3a1-ee2c1a98f514  medium.disk  true
master/71697842-061b-450a-ac0b-73f04012a22a  running        az1  10.193.90.91  vm-cd2918ea-8701-4d2a-87d8-2170f31cf144  medium.disk  true
master/de2fda96-f396-4f8d-8bcb-c40306d4d88e  running        az1  10.193.90.93  vm-2cdb8541-61c0-4c80-8ae4-97251a1a98fc  medium.disk  true
worker/39449084-e393-4fc4-a7b5-1ba613227012  running        az1  10.193.90.95  vm-cb68be3e-cf1f-406e-b786-ad2f31f67937  medium.disk  true
worker/622db2c3-3c01-4ddd-84a3-9e702dc34e54  running        az1  10.193.90.96  vm-fc6b2f3b-b9fc-42f8-be3d-306a0029aa55  medium.disk  true
worker/b026c929-6054-477d-a049-de24ecca0d76  running        az1  10.193.90.97  vm-1ca53832-54bf-4a7e-ac27-20f64ebb3be1  medium.disk  true
worker/cae662e4-6380-4126-a981-b1f0e5837952  running        az1  10.193.90.98  vm-1771a9ab-3104-4c4b-b04c-033a8b6ada42  medium.disk  true
worker/dfa10f94-e690-4249-8463-dc7d9fc3efe6  running        az1  10.193.90.94  vm-31ef83bd-0ea7-476c-8911-659dd3c584ce  medium.disk  true

3. Run the drain command directly from kubectl:

kubectl drain 77c0e5c7-546b-46a7-8e47-8908687980f5  --grace-period 10 --force --delete-local-data --ignore-daemonsets

node/77c0e5c7-546b-46a7-8e47-8908687980f5 already cordoned
WARNING: Ignoring DaemonSet-managed pods: fluent-bit-mbvtg
error when evicting pod "nginx-9cbcd98fd-lb7hj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

4. The output confirms the pod, nginx-9cbcd98fd-lb7hj, cannot evict the pod as it would violate the pod's disruption budget.

Environment

Cause

An Application Owner can create a PodDisruptionBudget object (PDB) for each application. A PDB limits the number pods of a replicated application that are down simultaneously from voluntary disruptions.

This is a known issue as the Kubernetes PDB can conflict with a PKS upgrade and prevent the kubelet job from being drained.

Resolution

First see if the PDB can be changed or even deleted to allow the upgrade to continue. If this does not resolve the issue, the following are some possible workarounds:

Configure .spec.replicas to be greater than the PodDisruptionBudget object. When the number of replicas configured in .spec.replicas is greater than the number of replicas set in the PodDisruptionBudget object, disruptions can occur.
For more information, see How Disruption Budgets Work in the Kubernetes documentation. For more information about workload capacity and uptime requirements in PKS, see Prepare to Upgrade in Upgrading PKS.

Notice that if you are to delete the PDB to allow the upgrade to continue updating all the worker nodes, you need to recreate the deleted PDB after upgrade is completed.