Recover SSPI/SSP setups after certificates used by kubernetes clusters expire

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

After approximately one year of continuous operation, an SSPI/SSP deployment's management and workload Kubernetes clusters can stop functioning because internal certificates - including kubeadm control-plane certificates and kubelet server certificates - expire without being renewed.

Once expired, attempts to interact with either cluster from the SSPI appliance will fail with x509 certificate errors or "no route to host" errors:

Log in to the SSPInstaller via cli using Sysadmin credentials if the SSP version is greater than 5.0; otherwise, use root credentials for SSPI version 5.0 to execute below commands

sspi:~$ kubectl get node
Error: x509: certificate has expired or is not yet valid

sspi:~$ k get node
Error: x509: certificate has expired or is not yet valid

This article explains how to recover from this state using the provided recovery script, and how to install the preventive script to avoid recurrence.

Environment

SSPI/SSP 5.0.0
SSPI/SSP 5.1.0
SSPI/SSP 5.1.1

Cause

Earlier versions of SSPI/SSP did not proactively renew internal Kubernetes certificates for the management or workload clusters. After one year of continuous operation, both clusters may stop functioning due to expired certificates.

Resolution

The recovery process has two phases:

Run the recovery script to restore the management cluster, VC connections, and workload control plane.
Run the preventive script (covered in the Preventive Script KB) to complete the recovery and prevent recurrence.
Important: The recovery is not complete until the preventive script is also installed. This is intentional.

To recover the setup, download the following script recover_ssp_with_expired_k8s_certs.sh to the HOME directory of SSPI root/sysadmin. Execute the script as root. The recovery script performs the following 3 steps to recover the setup to the point where the management cluster, VC connections and workload control plane are back in working state. To fully recover SSPI and SSP, we must also install the preventive script.

Step-by-Step: Running the Recovery Script

Download recover_ssp_with_expired_k8s_certs.sh from this KB article to your local device.
Transfer the script to the /tmp directory on the SSPI appliance.
SSH into the SSPI appliance:
- SSPI 5.1.0 and above: connect as sysadmin. ( and run sudo su to login as root)
- SSPI 5.0.0: connect as root.
Navigate to the /tmp directory.
In a separate terminal, create the file recovery_setup.sh with your environment credentials. Using a separate terminal avoids exposing the password in your session history. Delete this file after use.

vi recovery_setup.sh

##Add below two lines in it and save the file.

export RECOVER_SECOP_FQDN="<SSPI IP or FQDN>"
export RECOVER_SECOP_PASSWORD='<SSPI admin password>'

Make the script executable, inject the required environment variables, and run it.

chmod +x recover_ssp_with_expired_k8s_certs.sh
source recovery_setup.sh
./recover_ssp_with_expired_k8s_certs.sh

Below are the steps performed by the script:

Step 1: Management Cluster Recovery:

This step renews the certificates used by the management Kubernetes cluster and restarts the necessary services. After this step, kubectl get node -A should succeed.

Step 2: VC Connection Verification:

This step verifies that the vCenter (VC) is reachable and that its server certificate has at least 7 days of remaining validity. The script also confirms that the management cluster sees the VC connection as healthy. If any condition is not met, the script exits with an explanation of what needs to be corrected.

Common remediation actions at this stage:

Ensure vCenter is up and reachable.
Replace the VC certificate if it is expired or near expiry.
Re-establish the VC connection via the UI if needed.

Note: When re-establishing the VC connection through the UI, it may initially report a failure with "unable to connect to workload cluster." This is expected — the workload cluster has not been recovered yet. On a subsequent script run, the VC connection should show as HEALTHY, and the script will proceed to Step 3.

Step 3: CP Patchup In-Situ:

In this step, the script SSHs into each of the control plane nodes and patches each node by renewing the certificates and restarting local services. It starts this process on all the control plane nodes.

It then SSHs back into each of these nodes and waits for the node to return to a working state.

When the script exits successfully, the user should be able to perform k get node if the certificate used by the workload kubeconfig file is still valid.

CP nodes will have the same age as before the script was executed.

After the recovery script,

Regardless of the approach, the script should complete with the following message:

  *+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+*

  *  S S P I   |   r e c o v e r _ s e t u p   +   K C P  *

  *  ~~  c e r t i f i c a t e s E x p i r y D a y s   a l r e a d y   s e t   ( n o   p a t c h )  ~~  *

  *+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+**+==+*

  >> * *   n s = nwssp2   |   k c p = nwssp2-controller   * * <<

  // management KUBECONFIG | no rollout wait on this run //

  ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~~

  *=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+*

  *  r e c o v e r _ s e t u p   -   C O M P L E T E  *

  *  local kubeadm renew + KCP rollout (default) or --force SSH CP renew  *

  *=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+=*=+*

  >> recover_setup: all steps finished OK. Next: run cert-hotfix/install_sspi_remediate_and_sync_kubeconfig.sh (cert-hotfix) to complete kubeconfig sync / remediate. <<

  // next: cert-hotfix/install_sspi_remediate_and_sync_kubeconfig.sh (cert-hotfix) //

Completion - Follow up Step:

Once the script completes successfully, the management cluster and the control plane of the workload cluster are recovered and working with new certificates. However, user MUST follow Preventive Script KB to complete the recovery steps and setup the environment to prevent this issue from recurring. The preventive script installation ensures the workload kubeconfig file is updated with a working copy so that the "k" command would work properly. Without installing the preventive script, the workload cluster recovery is not complete. This is intentional to make sure that users don't forget to run this final step.

Cautions:

After successful completion of the above steps, the workload cluster often experiences node rollout. All 4 worker nodes can be deleted and a new set gets created. The control nodes often also experience rolling out even after their certificates have been renewed. This is because during the period of certificate expiry, these nodes have been marked in Cluster-API as either unhealthy workers or certificate-expired control plane nodes. As a result, once they become healthy again, they get rollout due to errors logged during certificate expiry period.

It may take extended period of time for the SSP cluster to become functional again.

Attachments

recover_ssp_with_expired_k8s_certs.sh get_app