Symptoms:
The Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster is in a failing state and the NCP status is Does not exist
Process 'blackbox' running
Process 'ncp' Does not exist
Process 'bosh-dns' running
You see messages similar to the following in /var/vcap/sys/log/ncp/ncp.stdout.log
2020-08-10T20:07:40.025Z #########-####-####-####-########### NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="WARNING"] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[DOWN] https://<NSX_MANAGER_FQDN>' due to: HTTPSConnectionPool(host='<NSX_MANAGER_FQDN>', port=443): Max retries exceeded with url: /api/v1/operational/application/status (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_read_bytes', 'sslv3 alert certificate unknown')],)",),))
1 2020-08-10T20:07:40.107Z #########-####-####-####-########### NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR"] vmware_nsxlib.v3.lib Unable to read tag limits. Reason: Service cluster: 'https://<NSX_MANAGER_FQDN>' is unavailable. Please, check NSX setup and/or configuration
When the tls-nsx-t certificate is expired, the apply changes from the OpsMan UI will fail on updating the master node with messages similar to the following:
Task 884651 | 23:29:14 | Updating instance master: master/#########-####-####-####-########### (0) (canary) (00:01:48)
L Error: Action Failed get_task: Task #########-####-####-####-########### result: 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp.
You see messages similar to the following in the /var/vcap/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log file on the cluster's Master node
Registering client certificate
#########-####-####-####-###########
Registration of client certificate is successful
Checking if client certificate is ready to be used
timeout: client certificate is not working after 60 seconds
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
NOTE: the method below is for old unsupported TKGi version. From TKGi 1.10+, you should use tkgi rotate-certificate cli as per: Tanzu Kubernetes Grid Integrated Edition Certificates. If the tkgi CLI fails to rotate certificates then below solution is also acceptable for newer versions if noted in documentation (for example if NSX cert rotation).
Identify the tls-nsx-t certificate that needs to be rotated and match with it in NSX-T Manager:
export BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=################### BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=<Bosh-ip-address>
export CREDHUB_CLIENT=$BOSH_CLIENT CREDHUB_SECRET=$BOSH_CLIENT_SECRET && credhub api --server $BOSH_ENVIRONMENT:8844 --ca-cert $BOSH_CA_CERT && credhub login
# bosh deployments --column=name | grep service-instance
service-instance_c#########-####-####-####-########### credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -enddate -noout
notBefore=Jul 23 22:19:50 2018 GMT
notAfter=Jul 23 22:19:50 2020 GMT >>>>> Certificate Is expired
pks-<cluster-UUID>.mkdir cluster-<The_Last_Numbers_of_Cluster-UUID>
cd cluster-<name>
credhub get -n /p-bosh/<service-instance_UUID>/tls-nsx-t --output-json | jq '.value | .certificate' -r > old-tls-nsx-t.crt
credhub r -n /p-bosh/<service-instance_UUID>/tls-nsx-t
id: ######-####-####-####-##########
name: /p-bosh/service-instance_######-####-####-####-##########/tls-nsx-t
type: certificate
value: <redacted>
version_created_at: "2020-08-10T19:20:50Z"
credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -startdate -enddate -noout
notBefore=Aug 10 19:20:50 2020 GMT
notAfter=Aug 10 19:20:50 2022 GM >>>>>> The certificate will expired in the next 2 years
Register the certificate with NSX-T and push it to Kubernetes VMs to replace the old certificate:
You can leverage pksnsxcli present on the Kubernetes master nodes to register the tls-nsx-t certificate with NSX-T Manager. The advantage of following this procedure is that it creates a locked certificate object which can’t be deleted via the UI.
bosh ssh -d service-instance_<cluster-uuid> master
sudo su –alias pksnsxcli=/var/vcap/packages/pks-nsx-t-cli/bin/pksnsxclipksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure
or
pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> -c "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.crt" -k "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.key" --nsx-ca-cert-path="/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_ca.crt" --insecure
pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure --api-type Policy
curl -X DELETE -sku 'admin:<password>' "https://<nsx manager>/policy/api/v1/infra/certificates/<policy-cert-id>" --header "X-Allow-Overwrite: true"
Issue commands similar to the following to recreate the cluster by using Bosh manifest, to register this certificate with NSX-T and push it to Kubernetes VMs:
bosh manifest -d service-instance_instance-id > service-instance_instance-id.yml
bosh deploy -d service-instance_instance-id service-instance_instance-id.yml
Note: This activity restarts only master VMs in order to update the tls-nsx-t certificate. There is no impact on worker nodes.