Updating master instance of TKGI cluster got stuck at post-start step

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The BOSH task got stuck at post-start step when updating the master instance of a TKGI cluster. As shown below, the task was hung at post-start step for a long time without being completed till timed out.

ask 4201 | 05:54:22 | Preparing deployment: Preparing deployment (00:00:08)
Task 4201 | 05:54:22 | Preparing deployment: Rendering templates (00:00:04)
Task 4201 | 05:54:26 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 4201 | 05:54:27 | Updating instance master: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:54:29 | L executing pre-stop: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:54:30 | L executing drain: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:54:31 | L stopping jobs: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:54:58 | L executing post-stop: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:55:16 | L installing packages: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:55:19 | L configuring jobs: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:55:19 | L executing pre-start: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:55:20 | L starting jobs: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:56:16 | L executing post-start: master/f029c082-####-####-####-3f037b37732c (0)

A successful update should look like:

......
Task 4201 | 05:55:19 | L executing pre-start: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:55:20 | L starting jobs: master/f029c082-####-####-####-3f037b37732c (0) (canary)
Task 4201 | 05:56:16 | L executing post-start: master/f029c082-####-####-####-3f037b37732c (0) (canary) (00:01:57)

However, when check this master instance state with either "bosh instances --ps" or "monit summary" command, it's shown all jobs running on the instance.

Environment

Tanzu Kubernetes Grid Integrated Edition

Cause

When checking the running processes on the problematic master instance, it's found the post-start script and the "kubectl delete pod -n kube-system metrics-server" command it executed never completed.

root       14592     884  0 04:54 ?        00:00:00 /bin/bash -e /var/vcap/jobs/kube-apiserver/bin/post-start

root       14855   14592  0 04:54 ?        00:00:00 xargs /var/vcap/packages/kubernetes/bin/kubectl --kubeconfig=/var/vcap/jobs/kube-controller-manager/config/admin-kubeconfig delete pod -n kube-system
root       14868   14855  0 04:54 ?        00:00:00 /var/vcap/packages/kubernetes/bin/kubectl --kubeconfig=/var/vcap/jobs/kube-controller-manager/config/admin-kubeconfig delete pod -n kube-system metrics-server-####-dmfw5 metrics-server-####-gkgs6 metrics-server-####-lbwvt metrics-server-####-sb5xm

And "kubectl -n kube-system get pods" command also showed several metrics-server-####pods in Terminating state. "kubectl describe" those pods returned FailedKillPod error.

Events:
  Type     Reason         Age                     From     Message
  ----     ------         ----                    ----     -------
  Warning  FailedKillPod  4m49s (x570 over 127m)  kubelet  error killing pod: failed to "KillPodSandbox" for "42bb2a35-####-####-####-155956d6538e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"bff337142e1a59af475####8de33e2c64108c9c0e78448adab5cc2d0f485\": plugin type=\"antrea\" failed (delete): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/antrea/cni.sock: connect: no such file or directory\""

Since previously there was network issue with the worker nodes hosting the pod containers, the FailedKillPod error was expected. However, seems it blocked deleting the pod from Kubernetes platform and consequently caused post-start script never completed.

Resolution

The metrics-server-#### pods in kube-system namespace were started by deployment resource. So the pods would be created again once they are deleted. If for some reason the pods could not be deleted successfully (as shown above), the pods in Terminating state can be forcefully deleted with command "kubectl -n kube-system delete pod metrics-server-#### --force". After the pods were deleted, try to initiate BOSH task to update the master instance again.