Upgrade Supervisor Cluster from 1.24.9 to 1.25.6 is stuck and not going forward.
On the vCenter GUI you can see the cluster on Configuring Status
Also, on Kubernetes status you can see this warnings:
csi and auth pods are on CrashLoopBackOff status
Open an ssh session to Supervisor Control Plane and run:
root@xxx [ ~ ]# kubectl get pods -A | egrep "NAMESPACE|CrashLoopBackOff"
NAMESPACE NAME READY STATUS RESTARTS AGE
vmware-system-csi vsphere-csi-controller-xxx-62572 5/7 CrashLoopBackOff 28 (86d ago) 51m
vmware-system-csi vsphere-csi-controller-xxx-hmjkz 6/7 CrashLoopBackOff 27 (4m44s ago) 51m
vmware-system-csi vsphere-csi-controller-xxx-lbh86 6/7 CrashLoopBackOff 29 (4m1s ago) 51m
vmware-system-tkg tanzu-auth-controller-managerxxx-pflgt 0/1 CrashLoopBackOff 13 (2m29s ago) 47m
LB certificates are not valid on the Supervisor Control Plane nodes.
find / -type f \( -name "*.cert" -o -name "*.crt" \) -print 2>/dev/null | egrep -iv 'ca.crt$|ca-bundle.crt$|kubelet\/pods|var\/lib\/containerd|run\/containerd|backup' | xargs -L 1 -t -i bash -c 'openssl x509 -noout -text -in {}|grep After'
bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.3.00100/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
Not After : Feb 17 14:11:03 2022 GMT
bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.2.00400/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
Not After : Feb 17 14:11:03 2022 GMT
bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.3.00000/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
Not After : Feb 17 14:11:03 2022 GMT
bash -c openssl x509 -noout -text -in /etc/vmware/wcp/tls/.ncp/lb-default.cert|grep After
Not After : Apr 10 18:55:22 2024 GMT
Steps to resolve the issue:
1. Rotate the certificates on the Supervisor Control Plane nodes
Open ssh session into Supervisor Control Plane Master node. Use /usr/lib/vmware-wcp/decryptK8Pwd.py to get the credentials.
run kubectl get nodes -o wide to know the other nodes IP address.
Run certmgr tool from KB Replace vSphere with Tanzu Supervisor Certificates to rotate certificates.
# ./certmgr certificates rotate
Repeat this command on the other SuperVisor Control Plane nodes to rotate the certificates.
2. Restart auth pods
run kubelet get pods -A command to get auth pods name.
# kubectl get pods -A | grep auth
Then restart the pods, one by one from last command result. Wait until pod is on Running state before restarting the next pod
# kubectl delete pod -n kube-system wcp-authproxy-xxx
3. Restart csi pods
run kubelet get pods -A command to get auth pods name.
# kubectl get pods -A | grep csi
Then restart the pods, one by one from last command result. Wait until pod is on Running state before restarting the next pod
# kubectl delete pod -n vmware-system-csi vsphere-csi-controller-xxx-xxx
After pods restarted and in Running state, upgrade process will continue as expected.