Supervisor cluster upgrade hung. CoreDNS and Kubeproxy pods are stuck in containercreating trying to mount its volume.

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

This KB article helps to resolve this issue and continue to upgrade the cluster

Symptoms:

Customer is upgrading the WCP cluster from 1.20 to 1.21 version.
Observed CoreDNS and Kubeproxy pods are stuck in containercreating trying to mount its volume.
WCp.log is just waiting for the api server to come back up:-

2022-03-09T10:52:19.128Z warning wcp [kubelib/retry.go:93] [opID=622988b2] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/tnmada-sd01-nr7t8/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.128Z warning wcp [kubelib/retry.go:93] [opID=622988c5] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/cbgt-c360-poc-6qz5x/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.129Z warning wcp [kubelib/retry.go:93] [opID=622988b8] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/vk8s-td01-w32ek/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.129Z warning wcp [kubelib/retry.go:93] [opID=622987fa] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/itt-tmws-01/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.129Z warning wcp [kubelib/retry.go:93] [opID=622988b4] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/ts-datahub-01/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.129Z warning wcp [kubelib/retry.go:93] [opID=622988b0] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/vk8sm-sd01-rq11n/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:19.129Z warning wcp [kubelib/retry.go:93] [opID=622988c1] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/stdbld-td01-26ny4/resourcequotas?timeout=2m0s. Will be retried.
2022-03-09T10:52:20.086Z warning wcp [kubelib/retry.go:93] [opID=622988ec] Request to apiserver failed. Err <nil>, Endpoint https://10.91.184.220:6443/api/v1/namespaces/vmware-system-csi/secrets/vsphere-config-secret?timeout=2m0s. Will be retried.

When we do kubectl get secret - A , we see the below error:

   Error from server: illegal base64 data at input byte 3

Environment

VMware vSphere 7.0 with Tanzu

Cause

This issue happens if `kubeadm-config` configmap contains `enable-aggregator-routing` flag in kube-apiserver spec. And we only included this flag in our GA and Day1 bits.
We can check in support bundle commands/kubectl_describe_configmaps.txt where the upgrade is stuck

Name:         kubeadm-config
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
ClusterConfiguration:
----
apiServer:
certSANs:
- 127.0.0.1
- x.x.x.x
- supervisor.default.svc
extraArgs:
admission-control-config-file: /etc/vmware/wcp/admission-control.yaml
anonymous-auth: "false"
audit-log-maxage: "30"
audit-log-maxbackup: "10"
audit-log-maxsize: "100"
audit-log-path: /var/log/vmware/audit/kube-apiserver.log
audit-policy-file: /etc/vmware/wcp/audit-policy.yaml
enable-admission-plugins: DenyEscalatingExec,NamespaceLifecycle,ServiceAccount,NodeRestriction,EventRateLimit,LimitRanger,PersistentVolumeLabel,DefaultStorageClass,DefaultTolerationSeconds,ResourceQuota,ValidatingAdmissionWebhook,PodSecurityPolicy
enable-aggregator-routing: "true"
enable-bootstrap-token-auth: "true"
experimental-encryption-provider-config: /etc/vmware/wcp/encryption-config.yaml
insecure-port: "0"
kubelet-https: "true"

So, theoretically, this issue will only happen if the WCP cluster was enabled using GA or Day1 releases and being upgraded all the way up to 1.21

Resolution

1. Add `--experimental-encryption-provider-config=/etc/vmware/wcp/encryption-config.yaml` to manifest /etc/kubernetes/manifests/kube-apiserver.yaml on all the CPVMs that are at 1.21.

2. All apiserver pods will be restarted automatically after step (1)

3. After step(1) and (2), run `kubectl get secrets -A` to see if there is any error. Usually there will be error like
- kubectl get secret -A
Error from server (internalError): Internal error occurred: unable to transform key "/registry/secrets/kube-system/boostrap-token-jd95rl": no matching prefix found.

This is because secret not encrypted correctly and stored in etcd.
Then On any CPVM, ran `ectdctl del <secret-key>`. secret-key is `/registry/secrets/kube-system/boostrap-token-jd95rl` in above example.

4. After step(3), `kubectl get secrets -A` and all the other commands should work without any failures.

5. In case if you are still getting error with 'kubectl get secrets -A' , run the below cmd to list all bootstrap-token:-

etcdctl get / --prefix --keys-only | grep '/registry/secrets/kube-system/bootstrap-token'

6. Remove the all boostrap-token listed from step-5
ectdctl del <secret-key>`. secret-key is `/registry/secrets/kube-system/boostrap-token-xxxx'

7. Apart from adding the line to kube-apiserver.yaml, also need to run a script (See attachment of this KB) to remove a flag from kubeadm-config configmap. So the script will fix the underlying root cause.
This script only needs to be run on 1 of the 3 CPVMs.

Additional Information

Impact/Risks:
WCP upgrade will stuck

Attachments

remove_enable_aggregator_routing_flag get_app