While trying to "Open Terminal cluster" for few of clusters under Virtual Infrastructure objects we see below issue. Issue seen for both vSphere and Kubernetes clusters
When we launch cluster terminal of TCA CP in the backed we can see PODs getting created on the respective TCA-CP, but it get stuck in pending state.
kubectl get pods -n tca-cp-cn | grep -i pending
tca-cp-cn 1234####-####-####-####-c7e5ba8f#### 0/1 Pending 0 17m
tca-cp-cn 5678####-####-####-####-6dc4015d#### 0/1 Pending 0 9h
tca-cp-cn 1234####-####-####-####-662bc696#### 0/1 Pending 0 6h
tca-cp-cn 12ab####-####-####-####-dc53d2a5#### 0/1 Pending 0 8h
tca-cp-cn 34cd####-####-####-####-d7b2a0f4#### 0/1 Pending 0 11m
TCA 3.x
The reason PODs are in pending state is due to Kubernetes scheduler was unable to communicate with other control plane components due to the use of expired certificates, even though cert rotation had occurred long back.
The war-machine-agent service with in tca appliance is responsible for automatically renewing k8s control plane component certs when they are within 60 days of expiry.
"Due to a race condition between two threads during the certificate rotation process, the control plane components do not restart, and they continue to use the old expired certificates"
Renew the certs using below steps:
1. We cleaned up all pending pods, using the following command.
tcaNamespace=$(kubectl get namespace tca-mgr >/dev/null 2>&1 && echo "tca-mgr" || echo "tca-cp-cn") kubectl get job -n $tcaNamespace |grep platform-mgr-tmp-cleanup-cronjob | awk '{print $1}' | xargs -I {} kubectl -n $tcaNamespace delete job {}
2. Restart controlplane components to take the rotated certs into effect.
mkdir -p /home/admin/manifests-bk/ mv /etc/kubernetes/manifests/* /home/admin/manifests-bk/ # wait for max of 30s till kubelet removes the control plane pod containers # check for kubectl get pods -A command to fail with connection refused mv /home/admin/manifests-bk/* /etc/kubernetes/manifests/ # wait for control plane pod containers to come up max wait timeout 20seconds you can check the same with the below command if up it should give output "ok" kubectl get --raw=/readyz --kubeconfig=/home/admin/.kube/config
3. Update the kubeconfig secret with the rotated kubeconfig file having new certs.
KUBECONFIG_B64=$(base64 -w 0 /etc/kubernetes/admin.conf) kubectl apply --kubeconfig /etc/kubernetes/admin.conf -f - <<EOF apiVersion: v1 kind: Secret metadata: name: kubeconfig-secret namespace: ${tcaNamespace} type: Opaque data: kubeconfig: ${KUBECONFIG_B64} EOF