Upgrading supervisor Cluster is stuck in configuring Status
search cancel

Upgrading supervisor Cluster is stuck in configuring Status

book

Article ID: 377662

calendar_today

Updated On:

Products

VMware vSphere with Tanzu

Issue/Introduction

   Upgrade Supervisor Cluster from 1.24.9 to 1.25.6 is stuck and not going forward.

   On the vCenter GUI you can see the cluster on Configuring Status

 

Also, on Kubernetes status you can see this warnings:

 

 

   csi and auth pods are on CrashLoopBackOff status

   Open an ssh session to Supervisor Control Plane and run:

   root@xxx [ ~ ]# kubectl get pods -A | egrep "NAMESPACE|CrashLoopBackOff"
   NAMESPACE                   NAME                                     READY   STATUS                  RESTARTS         AGE

   vmware-system-csi           vsphere-csi-controller-xxx-62572         5/7     CrashLoopBackOff        28 (86d ago)     51m
   vmware-system-csi           vsphere-csi-controller-xxx-hmjkz         6/7     CrashLoopBackOff        27 (4m44s ago)   51m
   vmware-system-csi           vsphere-csi-controller-xxx-lbh86         6/7     CrashLoopBackOff        29 (4m1s ago)    51m
   vmware-system-tkg           tanzu-auth-controller-managerxxx-pflgt   0/1     CrashLoopBackOff        13 (2m29s ago)   47m

 

Cause

LB certificates are not valid on the Supervisor Control Plane nodes.

find / -type f \( -name "*.cert" -o -name "*.crt" \) -print 2>/dev/null | egrep -iv 'ca.crt$|ca-bundle.crt$|kubelet\/pods|var\/lib\/containerd|run\/containerd|backup' | xargs -L 1 -t -i bash -c 'openssl x509 -noout -text -in {}|grep After'

bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.3.00100/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
            Not After : Feb 17 14:11:03 2022 GMT
bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.2.00400/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
            Not After : Feb 17 14:11:03 2022 GMT
bash -c 'openssl x509 -noout -text -in /storage/core/software-update/updates/8.0.3.00000/scripts/patches/payload/components-script/vcdb_vmodl/currentPyVpx/tests/connectionLimit.crt|grep After'
            Not After : Feb 17 14:11:03 2022 GMT

bash -c openssl x509 -noout -text -in /etc/vmware/wcp/tls/.ncp/lb-default.cert|grep After

               Not After : Apr 10 18:55:22 2024 GMT

 

Resolution

 

Steps to resolve the issue:

1. Rotate the certificates on the Supervisor Control Plane nodes

   Open ssh session into Supervisor Control Plane Master node. Use /usr/lib/vmware-wcp/decryptK8Pwd.py to get the credentials.

   run  kubectl get nodes -o wide to know the other nodes IP address.

   Run certmgr tool from KB Replace vSphere with Tanzu Supervisor Certificates to rotate certificates.

   # ./certmgr certificates rotate

 

   Repeat this command on the other SuperVisor Control Plane nodes to rotate the certificates.


2. Restart auth pods

   run kubelet get pods -A command to get auth pods name.

    # kubectl get pods -A | grep auth

   Then restart the pods, one by one from last command result. Wait until pod is on Running state before restarting the next pod

    # kubectl delete pod -n kube-system wcp-authproxy-xxx      

 

3. Restart csi pods

   run kubelet get pods -A command to get auth pods name.

    # kubectl get pods -A | grep csi

   Then restart the pods, one by one from last command result. Wait until pod is on Running state before restarting the next pod

    # kubectl delete pod -n vmware-system-csi vsphere-csi-controller-xxx-xxx

 

After pods restarted and in Running state, upgrade process will continue as expected.