TCA Kubelet not functioning correctly after internal certificate rotation

Products

VMware Telco Cloud Automation

Issue/Introduction

The following are possible combination of symptoms that would require this KB:

1. 'Open Terminal' feature does not function correctly.

When a user opens a terminal on a Cluster or CNF, the terminal opens and times-out after a while without functioning correctly.

2. Pods in Pending state on TCA appliances

In relation to the symptom mentioned above, any new pods spawned within the TCA appliance end up in a Pending state

3. Kubelet scheduler logs have errors in scheduling logs and report a lot of unauthorized log messages.

Run the following command to see the kubelet scheduler logs:

#kubectl logs kube-scheduler-photon -n kube-system

4. "/logs" partition within the TCA appliance is full or is filling up fast

Typically the /logs/retained-logs/kubelet.service folder ends up consuming almost all the space.

#df -h

5. Kubelet apiserver logs clearly state that certificate has expired.

Run the following command to see the kubelet apiserver logs:

#kubectl logs kube-apiserver-photon -n kube-system

The following logs are being spammed constantly:

"Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time ...

6. A lot of platform-mgr-tmp-cleanup-cronjob pods are stuck or in error state.

#kubectl get pods -A | grep platform-mgr-tmp-cleanup-cronjob
#kubectl get jobs -A | grep platform-mgr-tmp-cleanup-cronjob

Environment

3.0.x
3.1.x
3.2.x

Cause

TCA appliances contain a Kubernetes cluster internally which has an internal certificate with an expiry of maximum of 1 year.
This certificate is automatically rotated internally around 60 days prior to expiration. This happens internally without any notification to the end-user.
As part of the certificate rotation process, the kubelet service and pods should also be restarted to ensure that they pick up the new certificates. If they are not, then they would still be functioning on the old certificates (which are still typically valid for 2 months).
Thus, during this time period, the system shows up completely functional with all the certificates rotated.

Note : To check kubelet certificate, one can execute
#kubeadm certs check-expiration

If the pods and the kubelet services were not restarted, the symptoms stated above will appear immediate after the original certificates are expired (typically 60 days after the certificates were rotated).
Upon investigation, it would seem that the certificates are fine and have rotated long ago, but the failure happens as the kubelet and the corresponding pods are not restarted correctly to pick up the new certificates.

Note : The symptoms could also be observed if the certificate rotation itself failed and did not happen.

Resolution

This would be fixed in the later version of TCA 3.4 which is targeted to release on Feb 2026.

Workaround :

If the certificates are renewed correctly, then follow the resolution below:

SSH to the TCA appliance using admin user

Switch to the root user

#su

Verify certificate expiry for the appliance cluster control plane and kubelet . If these certs are expired, then KB 382787 must be applied first before proceeding.

#kubeadm cert check-expiration

#openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -enddate

Browse to /logs/retained-logs/kubelet.service and delete old logs from here to reclaim space.

Cleanup pending platform-mgr-tmp-cleanup-cronjob jobs

#tcaNamespace=$(kubectl get namespace tca-mgr >/dev/null 2>&1 && echo "tca-mgr" || echo "tca-cp-cn")
kubectl get job -n $tcaNamespace |grep platform-mgr-tmp-cleanup-cronjob | awk '{print $1}' | xargs -I {} kubectl -n $tcaNamespace delete job {}

Restart the kubelet components to ensure that they pick up the new certificates

#mkdir -p /home/admin/manifests-bk/

mv /etc/kubernetes/manifests/* /home/admin/manifests-bk/

# wait for max of 30s till kubelet removes the control plane pod containers
# check for kubectl get pods -A command to fail with connection refused

mv /home/admin/manifests-bk/* /etc/kubernetes/manifests/

# wait for control plane pod containers to come up max wait timeout 20seconds you can check the same with the below command if up it should give output "ok"

kubectl get --raw=/readyz --kubeconfig=/home/admin/.kube/config

Update the kubeconfig secret with the rotated kubeconfig file having new certs

KUBECONFIG_B64=$(base64 -w 0 /etc/kubernetes/admin.conf)
kubectl apply --kubeconfig /etc/kubernetes/admin.conf -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: kubeconfig-secret
namespace: ${tcaNamespace}
type: Opaque
data:
kubeconfig: ${KUBECONFIG_B64}
EOF

If there are any such pods still in Pending state, please delete them:

#tca-cp-cn 1234####-####-####-####-a4c1eb2c#### 0/1 Pending 0 22m

Ensure all the pods are up and running.