Recovering TCA from NodeConfig-Operator Webhook Service Certificate Expiration
search cancel

Recovering TCA from NodeConfig-Operator Webhook Service Certificate Expiration

book

Article ID: 325397

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction


Provide a workaround to update the certificate and restore all operations.

Symptoms:


Operations like cluster creation, node pool creation and node customizations will fail.
 
To check if the TKG cluster currently has an expired certificate, SSH into the TKG management cluster master node as the capv user, and download the diagnostic script :

ssh capv@[TkgMgmtIP]
curl -kfsSL 'https://vmwaresaas.jfrog.io/artifactory/generic-registry/run-diagnosis' -o run-diagnosis.sh

 
Set the correct permissions on the script after necessary verification of the downloaded script

chmod +x run-diagnosis.sh 

 
Execute the script from node:

./run-diagnosis.sh
 

Check the HTML test reports for detailed results and tests information.

In addition, the bootstrap logs will indicate an expired certificate:

 Sep 2 18:07:29 apiserverd[7521] : [Err-controller] : Failed to handle plugins by NodeConfig CR in cluster 180f5152-064b-49f2-b207-4dfa09c8a9e1, err: Error from server (InternalError): error when creating "/opt/vmware/k8s-bootstrapper/180f5152-064b-49f2-b207-4dfa09c8a9e1/np_addon-nodeprofile-mgmt-cl1.yaml" Internal error occurred: failed calling webhook "validator.nodeconfig.acm.vmware.com": Post "https://nodeconfigvalidator.tca-system.svc:443/validate-nodeconfig?timeout=5s": x509: certificate has expired or is not yet valid: current time 2021-09-02T18:07:18Z is after 2021-09-02T07:37:00Z


Environment

VMware Telco Cloud Automation 1.9
VMware Telco Cloud Automation 1.8

Cause


This scenario happens when the webhook service certificate has expired for a cluster where nodeconfig-operator is enabled. As of September 2nd, the certificate in all nodeconfig operators in both management and workload clusters has expired. Operations like cluster creation, node pool creation and node customizations will fail due to this issue.

Resolution


1. SSH into the TCA-CP appliance and switch to root user.
 
2. Create a temporary directory and change directory to it. Run steps 3 and 4 within this directory. 
 
3. Download the script for updating certs by running the following command: 
4. Verify that md5sum for downloaded script is correct.
[root@tca-cp-c101 ~/certs]# md5sum update_cert.sh
9e8ef63e156124d2cc98354bc67bc5cb  update_cert.sh

5. Set the correct permissions on the script after necessary verification of the downloaded script:
chmod +x update_cert.sh 
 
6. Execute the script from node:
./update_cert.sh
 
You will see output stating that certificates have changed for every management cluster and workload cluster provisioned via this TCA-CP.  If the clusters are not in a healthy state, changing the certificates may fail.
current cluster is cluster: mgmt01
secret/nodeconfig-certs changed
update cluster 1e0c2114-6d71-497e-b478-a253b12e45b1 succeed
/opt/vmware/k8s-bootstrapper/e8c910f5-c650-4669-895f-ae27cac0265d/kubeconfig exist
current cluster is cluster: wrk01
secret/nodeconfig-certs changed
update cluster e8c910f5-c650-4669-895f-ae27cac0265d succeed
 
Copying the new secret may fail for clusters with an API endpoint in an unreachable state. This is expected and not to be considered as a failure in applying cert extension. The above script will proceed with the next cluster and copy nodeconfig secret to all operational clusters.
 
7. Repeat on all TCA-CP appliances.

Affected Versions: TCA 1.8,TCA 1.9
Supported Versions : TCA 1.8 & TCA 1.9