Symptoms:
The Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster is in a failing state and the NCP status is Does not exist
Process 'blackbox' running
Process 'ncp' Does not exist
Process 'bosh-dns' running
You see messages similar to the following in /var/vcap/sys/log/ncp/ncp.stdout.log
2020-08-10T20:07:40.025Z #########-####-####-####-########### NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="WARNING"] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[DOWN] https://<NSX_MANAGER_FQDN>' due to: HTTPSConnectionPool(host='<NSX_MANAGER_FQDN>', port=443): Max retries exceeded with url: /api/v1/operational/application/status (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_read_bytes', 'sslv3 alert certificate unknown')],)",),))
1 2020-08-10T20:07:40.107Z #########-####-####-####-########### NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR"] vmware_nsxlib.v3.lib Unable to read tag limits. Reason: Service cluster: 'https://<NSX_MANAGER_FQDN>' is unavailable. Please, check NSX setup and/or configuration
When the tls-nsx-t certificate is expired, the apply changes from the OpsMan UI will fail on updating the master node with messages similar to the following:
Task 884651 | 23:29:14 | Updating instance master: master/#########-####-####-####-########### (0) (canary) (00:01:48)
L Error: Action Failed get_task: Task #########-####-####-####-########### result: 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp.
You see messages similar to the following in the /var/vcap/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log
file on the cluster's Master node
Registering client certificate
#########-####-####-####-###########
Registration of client certificate is successful
Checking if client certificate is ready to be used
timeout: client certificate is not working after 60 seconds
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
NOTE: the method below is for old unsupported TKGi version. From TKGi 1.10+, you should use tkgi rotate-certificate cli as per: Tanzu Kubernetes Grid Integrated Edition Certificates
Identify the tls-nsx-t certificate that needs to be rotated and match with it in NSX-T Manager:
export BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=gu_v9OiwFmDjDnrQ-9Kpwca121lTYzxx BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=<Bosh-ip-address>
export CREDHUB_CLIENT=$BOSH_CLIENT CREDHUB_SECRET=$BOSH_CLIENT_SECRET && credhub api --server $BOSH_ENVIRONMENT:8844 --ca-cert $BOSH_CA_CERT && credhub login
# bosh deployments --column=name | grep service-instance
service-instance_c#########-####-####-####-###########
credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -enddate -noout
notBefore=Jul 23 22:19:50 2018 GMT
notAfter=Jul 23 22:19:50 2020 GMT >>>>> Certificate Is expired
pks-<cluster-UUID>
.mkdir cluster-<The_Last_Numbers_of_Cluster-UUID>
cd cluster-<name>
credhub get -n /p-bosh/<service-instance_UUID>/tls-nsx-t --output-json | jq '.value | .certificate' -r > old-tls-nsx-t.crt
credhub r -n /p-bosh/<service-instance_UUID>/tls-nsx-t
id: ######-####-####-####-##########
name: /p-bosh/service-instance_######-####-####-####-##########/tls-nsx-t
type: certificate
value: <redacted>
version_created_at: "2020-08-10T19:20:50Z"
credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -startdate -enddate -noout
notBefore=Aug 10 19:20:50 2020 GMT
notAfter=Aug 10 19:20:50 2022 GM >>>>>> The certificate will expired in the next 2 years
Register the certificate with NSX-T and push it to Kubernetes VMs to replace the old certificate:
You can leverage pksnsxcli present on the Kubernetes master nodes to register the tls-nsx-t certificate with NSX-T Manager. The advantage of following this procedure is that it creates a locked certificate object which can’t be deleted via the UI.
bosh ssh -d service-instance_<cluster-uuid> master
sudo su –
alias pksnsxcli=/var/vcap/packages/pks-nsx-t-cli/bin/pksnsxcli
pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure
pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure --api-type Policy
curl -X DELETE -sku 'admin:<password>' "https://<nsx manager>/policy/api/v1/infra/certificates/<policy-cert-id>" --header "X-Allow-Overwrite: true"
Issue commands similar to the following to recreate the cluster by using Bosh manifest, to register this certificate with NSX-T and push it to Kubernetes VMs:
bosh manifest -d service-instance_instance-id > service-instance_instance-id.yml
bosh deploy -d service-instance_instance-id service-instance_instance-id.yml
Note: This activity restarts only master VMs in order to update the tls-nsx-t certificate. There is no impact on worker nodes.