How to rotate Tanzu Kubernetes Grid Integrated Edition tls-nsx-t cluster certificate
search cancel

How to rotate Tanzu Kubernetes Grid Integrated Edition tls-nsx-t cluster certificate

book

Article ID: 330615

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

Symptoms:
  • The Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster is in a failing state and the NCP  status is Does not exist
Process 'blackbox'                  running
Process 'ncp'                       Does not exist
Process 'bosh-dns'                  running

 
  • You see messages similar to the following in /var/vcap/sys/log/ncp/ncp.stdout.log
 2020-08-10T20:07:40.025Z d732fce2-1443-4726-9c03-5ac2ccc68bc9 NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="WARNING"] vmware_nsxlib.v3.cluster Failed to validate API cluster endpoint '[DOWN] https://nsxmanager.corp.local' due to: HTTPSConnectionPool(host='nsxmanager.corp.local', port=443): Max retries exceeded with url: /api/v1/operational/application/status (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_read_bytes', 'sslv3 alert certificate unknown')],)",),))
1 2020-08-10T20:07:40.107Z d732fce2-1443-4726-9c03-5ac2ccc68bc9 NSX 31475 - [nsx@6876 comp="nsx-container-ncp" subcomp="ncp" level="ERROR"] vmware_nsxlib.v3.lib Unable to read tag limits. Reason: Service cluster: 'https://nsxmanager.corp.local' is unavailable. Please, check NSX setup and/or configuration

 
  • When the tls-nsx-t certificate is expired, the apply changes from the OpsMan UI will fail on updating the master node with messages similar to the following:
Task 884651 | 23:29:14 | Updating instance master: master/37998116-88cd-4724-ae39-d2d6fd20c9da (0) (canary) (00:01:48)
L Error: Action Failed get_task: Task 038235af-36ce-4455-59b2-c70bbece4e17 result: 1 of 7 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp
.
  • You see messages similar to the following in the /var/vcap/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log file on the cluster's Master node
Registering client certificate
9edc5054-376c-466d-ba30-dd6c80d014f8
Registration of client certificate is successful
Checking if client certificate is ready to be used
timeout: client certificate is not working after 60 second
s

 
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Environment

VMware PKS 1.x

Resolution

NOTE: the method below is for old unsupported TKGi version. From TKGi 1.10+, you should use tkgi rotate-certificate cli as per: Tanzu Kubernetes Grid Integrated Edition Certificates

Identify the tls-nsx-t certificate that needs to be rotated and match with it in NSX-T Manager:

  1. Run the commands similar to the following to set needed environment variables for credhub cli to work.

    export BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=gu_v9OiwFmDjDnrQ-9Kpwca121lTYzxx BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=<Bosh-ip-address>

    export CREDHUB_CLIENT=$BOSH_CLIENT CREDHUB_SECRET=$BOSH_CLIENT_SECRET && credhub api --server $BOSH_ENVIRONMENT:8844 --ca-cert $BOSH_CA_CERT && credhub login

  2. Run the following command to get the Service Deployment UUID

    # bosh deployments --column=name | grep service-instance

    Note: You will see output similar to the following:

    service-instance_c18dd0e4-00b6-4054-9f86-b1a82aebac1e

  3. Run a command similar to the following to verify the expiration date of the TLS-NSX- cluster certificate

    credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -enddate -noout

    Note: Replace the UUID in the command with the cluster UUID. You will see output similar to the following:

    notBefore=Jul 23 22:19:50 2018 GMT
    notAfter=Jul 23 22:19:50 2020 GMT    >>>>> Certificate Is expired

  4. Verify tls-nsx-t certificate in NSX-T by going to System > Certificates and look for the certificate with the name in the format of pks-<cluster-UUID>.

  5. Run a command similar to the following to create a cluster directory:

    mkdir cluster-<The_Last_Numbers_of_Cluster-UUID>

  6. Issue a command similar to the following to navigate to the cluster directory

    cd cluster-b1a82aebac1e

  7. Issue a command similar to the following to backup the old cluster tls-nsx-t certificate.

    credhub get -n /p-bosh/<service-instance_UUID>/tls-nsx-t --output-json | jq '.value | .certificate' -r > old-tls-nsx-t.crt

  8. Issue a command similar to the following generate the new certificate:

    ​​​​​​​credhub r -n /p-bosh/<service-instance_UUID>/tls-nsx-t

    Note: You will see output similar to the following:

    id: c4bf1cdb-af06-4a88-b25e-4f3123c939db
    name: /p-bosh/service-instance_c18dd0e4-00b6-4054-9f86-b1a82aebac1e/tls-nsx-t
    type: certificate
    value: <redacted>
    version_created_at: "2020-08-10T19:20:50Z"

  9. Issue a command similar to the following to validate that the certificate got created.

    ​​​​credhub get -n /p-bosh/service-instance_UUID/tls-nsx-t --output-json | jq '.value | .certificate' -r | openssl x509 -startdate -enddate -noout

    Note:  You will see output similar to the following:

    notBefore=Aug 10 19:20:50 2020 GMT
    notAfter=Aug 10 19:20:50 2022 GM      >>>>>> The certificate will expired in the next 2 years


Register the certificate with NSX-T and push it to Kubernetes VMs to replace the old certificate:

You can leverage pksnsxcli present on the Kubernetes master nodes to register the tls-nsx-t certificate with NSX-T Manager. The advantage of following this procedure is that it creates a locked certificate object which can’t be deleted via the UI.

  1. Issue a command similar to the following to ssh to the master vm:

    bosh ssh -d service-instance_<cluster-uuid> master
  1. Switch to root mode by running sudo su –
  2. ​​​​​​​Set the pksnsxcli alias by running alias pksnsxcli=/var/vcap/packages/pks-nsx-t-cli/bin/pksnsxcli
  3. Delete the old certificate by running a command similar to the following:

    For manager API:
    pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure

    For policy API:
    pksnsxcli delete principal --instance-id <cluster-instance-id> --nsx-manager-host <nsx-manager-hostname> --username <username> --password <password> --insecure --api-type Policy

    Note: Use only the cluster ID here in this command (Ex: --instance-id c18dd0e4-00b6-4054-9f86-b1a82aebac1e), do not use the complete service instance name.

    Note: All control plane operations will fail after deleting the principal identity but there will be no impact on the running workloads. New cluster creation will also work at this time. The control plane impact is just isolated to this cluster. Also, this certificate is no longer visible under NSX-T System → Certificates.

    Note: There's a known issue with TKGI< 1.14, nsx cert rotation imports new certs into NSX without set ting the display name to TKGI convention pks-<cluster-uuid>, hence the rotated cert in NSX displays as an auto-generated strings eg. XXXXXxxx-XXXX-XXXX-XXXX-XXXXXXXXXXXX. Use the policy API by ID for this step if pksnsxcli does not work due to the display name not having pks-<cluster-uuid>:

    curl -X DELETE -sku 'admin:<password>' "https://<nsx manager>/policy/api/v1/infra/certificates/<policy-cert-id>" --header "X-Allow-Overwrite: true"
  1. Issue commands similar to the following to recreate the cluster by using Bosh manifest, to register this certificate with NSX-T and push it to Kubernetes VMs:

    bosh manifest -d service-instance_instance-id > service-instance_instance-id.yml

    bosh deploy -d service-instance_instance-id service-instance_instance-id.yml

    Note: This activity restarts only master VMs in order to update the tls-nsx-t certificate. There is no impact on worker nodes.




Additional Information