TCA Kubelet not functioning correctly after internal certificate rotation
search cancel

TCA Kubelet not functioning correctly after internal certificate rotation

book

Article ID: 423202

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Telco Cloud Platform

Issue/Introduction

The following are possible combination of symptoms that would require this KB:

  • 'Open Terminal' feature does not function correctly.

    • When a user opens a terminal on a Cluster or CNF, the terminal opens and times-out after a while without functioning correctly.

  • Pods in Pending state on TCA appliances

    • In relation to the symptom mentioned above, any new pods spawned within the TCA appliance end up in a Pending state 

  • Kubelet scheduler logs have errors in scheduling logs and report a lot of unauthorized log messages.

    • Run the following command to see the kubelet scheduler logs:

      # kubectl logs kube-scheduler-photon -n kube-system

  • "/logs" partition within the TCA appliance is full or is filling up fast

    • Typically the /logs/retained-logs/kubelet.service folder ends up consuming almost all the space.

      # df -h

  • Kubelet apiserver logs clearly state that certificate has expired.

    • Run the following command to see the kubelet apiserver logs:

# kubectl logs kube-apiserver-photon -n kube-system

    • The following logs are being spammed constantly:

"Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time ...

  • A lot of platform-mgr-tmp-cleanup-cronjob pods are stuck or in error state.

# kubectl get pods -A | grep platform-mgr-tmp-cleanup-cronjob
# kubectl get jobs -A | grep platform-mgr-tmp-cleanup-cronjob

Environment

TCA: 3.0.x, 3.1.x, 3.2.x, 3.3.0.1

TCP: 3.1, 4.0, 4.0.1, 5.0, 5.0.2

Cause

  • TCA appliances contain a Kubernetes cluster internally which has an internal certificate with an expiration date maximum of 1 year.
  • This certificate is automatically rotated internally 60 days prior to expiration. This happens internally without any notification to the end-user.
  • As part of the certificate rotation process, the kubelet service and pods should also be restarted to ensure that they pick up the new certificates. If they are not, then they will be functioning on the old certificates (which are still typically valid for 2 months).
  • During this time period, the system appears completely functional with all the certificates rotated.

    Note: To check kubelet certificate, one can execute 

    # kubeadm certs check-expiration

  • If the pods and the kubelet services were not restarted, the symptoms stated above will appear immediate after the original certificates are expired (typically 60 days after the certificates were rotated).
  • Upon investigation, it would seem that the certificates are fine and have rotated long ago, but the failure happens as the kubelet and the corresponding pods are not restarted correctly to pick up the new certificates.

    Note: The symptoms could also be observed if the certificate rotation itself failed and did not happen.

Resolution

  • Resolved in TCA 3.3.0.1

Workaround :

If the certificates are renewed correctly, then follow the resolution below:

  1. SSH to the TCA appliance using admin user
  2. Switch to the root user

    # su

  3. Verify certificate expiry for the appliance cluster control plane and kubelet . If these certs are expired, then KB 382787 must be applied first before proceeding.

    # kubeadm certs check-expiration
    # openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -enddate

  4. Browse to /logs/retained-logs/kubelet.service and delete old logs from here to reclaim space.
  5. Cleanup pending platform-mgr-tmp-cleanup-cronjob jobs

    # tcaNamespace=$(kubectl get namespace tca-mgr >/dev/null 2>&1 && echo "tca-mgr" || echo "tca-cp-cn")
    kubectl get job -n $tcaNamespace |grep platform-mgr-tmp-cleanup-cronjob | awk '{print $1}' | xargs -I {} kubectl -n $tcaNamespace delete job {}

  6. If this doesen't clean a space in the disk go to the directory /dev/mapper/vg_logs-lv_logs and run command below :

    find /logs -type f \( -size +40M -o -iname "*.tar.gz" \) -delete

  7. Restart the kubelet components to ensure that they pick up the new certificates

    # mkdir -p /home/admin/manifests-bk/

    # mv /etc/kubernetes/manifests/* /home/admin/manifests-bk/

    Note: Wait for max of 30s till kubelet removes the control plane pod containers. Check for kubectl get pods -A command to fail with connection refused

    # mv /home/admin/manifests-bk/* /etc/kubernetes/manifests/

    Note: Wait for control plane pod containers to come up max wait timeout 20 seconds you can check the same with the below command if up it should give output "ok"

    # kubectl get --raw=/readyz --kubeconfig=/home/admin/.kube/config

  8. Update the kubeconfig secret with the rotated kubeconfig file having new certs

    KUBECONFIG_B64=$(base64 -w 0 /etc/kubernetes/admin.conf)
    kubectl apply --kubeconfig /etc/kubernetes/admin.conf -f - <<EOF
    apiVersion: v1
    kind: Secret
    metadata:
      name: kubeconfig-secret
      namespace: ${tcaNamespace}
    type: Opaque
    data:
      kubeconfig: ${KUBECONFIG_B64}
    EOF

  9. If there are any such pods still in Pending state, please delete them:

    Example:
    #tca-cp-cn                  1234####-####-####-####-a4c1eb2c####             0/1     Pending     0             22m

  10. Ensure all the pods are up and running.