Recovering TCA from NodeConfig-Operator Webhook Service Certificate Expiration

search cancel

Recovering TCA from NodeConfig-Operator Webhook Service Certificate Expiration

book

Article ID: 325397

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

Operations like cluster creation, node pool creation and node customizations will fail.

Validate TKG cluster certificates

SSH into the TKG management cluster master node as the capv user
Download the attached run-diagnosis.sh diagnostic script.
Set the correct permissions on the script after necessary verification of the downloaded script
chmod +x run-diagnosis.sh
Execute the script from node:
./run-diagnosis.sh

Check the HTML test reports for detailed results and tests information.

Also see bootstrap log expired certificate error:

 Sep  2 18:07:29 apiserverd[7521] : [Err-controller] : Failed to handle plugins by NodeConfig CR in cluster 180f5152-064b-49f2-b207-4dfa09c8a9e1, err: Error from server (InternalError): error when creating "/opt/vmware/k8s-bootstrapper/180f5152-064b-49f2-b207-4dfa09c8a9e1/np_addon-nodeprofile-mgmt-cl1.yaml" Internal error occurred: failed calling webhook "validator.nodeconfig.acm.vmware.com": Post "https://nodeconfigvalidator.tca-system.svc:443/validate-nodeconfig?timeout=5s": x509: certificate has expired or is not yet valid: current time 2021-09-02T18:07:18Z is after 2021-09-02T07:37:00Z

Environment

1.x

Cause

This scenario happens when the webhook service certificate has expired for a cluster where nodeconfig-operator is enabled. As of September 2nd, the certificate in all nodeconfig operators in both management and workload clusters has expired. Operations like cluster creation, node pool creation and node customizations will fail due to this issue.

Resolution

1. SSH into the TCA-CP appliance and switch to root user.

2. Create a temporary directory and change directory to it. Run steps 3 and 4 within this directory.

3. Download the attached update_cert.sh script.

4. Verify that md5sum for downloaded script is correct.

# md5sum update_cert.sh
9e8ef63e156124d2cc98354bc67bc5cb update_cert.sh

5. Set the correct permissions on the script after necessary verification of the downloaded script:

chmod +x update_cert.sh

6. Execute the script from node:
./update_cert.sh

You will see output stating that certificates have changed for every management cluster and workload cluster provisioned via this TCA-CP. If the clusters are not in a healthy state, changing the certificates may fail.

current cluster is cluster: mgmt01
secret/nodeconfig-certs changed
update cluster 1e0c2114-6d71-497e-b478-a253b12e45b1 succeed
/opt/vmware/k8s-bootstrapper/e8c910f5-c650-4669-895f-ae27cac0265d/kubeconfig exist
current cluster is cluster: wrk01
secret/nodeconfig-certs changed
update cluster e8c910f5-c650-4669-895f-ae27cac0265d succeed

Copying the new secret may fail for clusters with an API endpoint in an unreachable state. This is expected and not to be considered as a failure in applying cert extension.
The above script will proceed with the next cluster and copy nodeconfig secret to all operational clusters.

7. Repeat on all TCA-CP appliances.

Attachments

update_cert.sh get_app

run-diagnosis get_app

Feedback

thumb_up Yes

thumb_down No