Recover SSPI/SSP setups after certificates used by kubernetes clusters expire
search cancel

Recover SSPI/SSP setups after certificates used by kubernetes clusters expire

book

Article ID: 439459

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

After approximately one year of continuous operation, an SSPI/SSP deployment's management and workload Kubernetes clusters can stop functioning because internal certificates - including kubeadm control-plane certificates and kubelet server certificates - expire without being renewed.

Once expired, attempts to interact with either cluster from the SSPI appliance will fail with x509 certificate errors or "no route to host" errors:

Log in to the SSPInstaller via cli using Sysadmin credentials if the SSP version is greater than 5.0; otherwise, use root credentials for SSPI version 5.0 to execute below commands 

sspi:~$ kubectl get node
Error: x509: certificate has expired or is not yet valid

sspi:~$ k get node
Error: x509: certificate has expired or is not yet valid

This article explains how to recover from this state using the provided recovery script, and how to install the preventive script to avoid recurrence.

Environment

  • SSPI/SSP 5.0.0
  • SSPI/SSP 5.1.0
  • SSPI/SSP 5.1.1

Cause

Earlier versions of SSPI/SSP did not proactively renew internal Kubernetes certificates for the management or workload clusters. After one year of continuous operation, both clusters may stop functioning due to expired certificates.

Resolution

The recovery process has two phases:

  1. Run the recovery script to restore the management cluster, VC connections, and workload control plane.
  2. Run the preventive script (covered in the Preventive Script KB) to complete the recovery and prevent recurrence.
    Important: The recovery is not complete until the preventive script is also installed. This is intentional.

To recover the setup, download the following script  recover_ssp_with_expired_k8s_certs.sh to the HOME directory of SSPI root/sysadmin.  Execute the script as root. The recovery script performs the following 3 steps to recover the setup to the point where the management cluster, VC connections and workload control plane are back in working state. To fully recover SSPI and SSP, we must also install the preventive script.

 

Step-by-Step: Running the Recovery Script

  1. Download recover_ssp_with_expired_k8s_certs.sh from this KB article to your local device.
  2. Transfer the script to the /tmp directory on the SSPI appliance.
  3. SSH into the SSPI appliance:
    • SSPI 5.1.0 and above: connect as sysadmin. ( and run sudo su to login as root)
    • SSPI 5.0.0: connect as root.
  4. Navigate to the /tmp directory.
  5. In a separate terminal, create the file recovery_setup.sh with your environment credentials. Using a separate terminal avoids exposing the password in your session history. Delete this file after use.
vi recovery_setup.sh

##Add below two lines in it and save the file.

export RECOVER_SECOP_FQDN="<SSPI IP or FQDN>"
export RECOVER_SECOP_PASSWORD='<SSPI admin password>'
  1. Make the script executable, inject the required environment variables, and run it.
chmod +x recover_ssp_with_expired_k8s_certs.sh
source recovery_setup.sh
./recover_ssp_with_expired_k8s_certs.sh

 

Below are the steps performed by the script:

Step 1: Management Cluster Recovery:

This step renews the certificates used by the management Kubernetes cluster and restarts the necessary services. After this step, kubectl get node -A should succeed.

Step 2: VC Connection Verification:

This step verifies that the vCenter (VC) is reachable and that its server certificate has at least 7 days of remaining validity. The script also confirms that the management cluster sees the VC connection as healthy. If any condition is not met, the script exits with an explanation of what needs to be corrected.

Common remediation actions at this stage:

  • Ensure vCenter is up and reachable.
  • Replace the VC certificate if it is expired or near expiry.
  • Re-establish the VC connection via the UI if needed.

Note: When re-establishing the VC connection through the UI, it may initially report a failure with "unable to connect to workload cluster." This is expected — the workload cluster has not been recovered yet. On a subsequent script run, the VC connection should show as HEALTHY, and the script will proceed to Step 3.

Step 3: CP Patchup In-Situ:

In this step, the script SSHs into each of the control plane nodes and patches each node by renewing the certificates and restarting local services. It starts this process on all the control plane nodes.

It then SSHs back into each of these nodes and waits for the node to return to a working state.

When the script exits successfully, the user should be able to perform k get node if the certificate used by the workload kubeconfig file is still valid.

CP nodes will have the same age as before  the script was executed.

 

Attachments

recover_ssp_with_expired_k8s_certs.sh get_app