2 Year Nats Certificate Rotation
search cancel

2 Year Nats Certificate Rotation

book

Article ID: 293688

calendar_today

Updated On:

Products

Operations Manager VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

In platforms that have been updated from versions installed prior to Operations Manager (Ops Manager) 2.2, originally created NATS certificates expire after 2 years. This can cause VM downtime if it expires.

The recommended rotation procedure requires a full Root CA rotation. If the platform is Ops Manager 2.2 or above, The newly created NATS certificate will be a 4 year certificate to match the 4 year Root CA certificate.


Environment

Opsman: 2.7
Opsman: 3.x

Resolution

The purpose of this article is to highlight the repair path to rotate a NATS certificate with a 2 year expiration and utilize the API Certificate Rotation method shown in this documentation: Rotate the Root CA and Leaf Certificates.

The NATS client certificate (nats_client_ca) is integrally paired to the current active Root CA certificate. To create a new NATS cert (NATS2) you must also create a new Root CA Certificate (CA2).

This procedure in production settings is best broken up into 3 rounds:

  1. Create a new Root CA / NATS Client pair (CA2/NATS2).
  2. Activate the new Root CA / NATS pair and Regenerate all non-configurable certificates.
  3. Delete the original Root CA and soon to expire NATS CA (CA1/NATS1).

 

After each stage there will be an Apply Changes. This apply change will feature a few similarities each round: 

  • BOSH Director Tile > Director Config > "Recreate VMs deployed by the BOSH Director" must be checked. This checkbox re-set after every successful apply change and will need be "checked" every round. Make sure to save this change (at the very bottom). 
  • The Apply Change must be on All tiles.
  • Any service instance tile should be run with the errands "Recreate all service instances" and "Update all service instances" enabled. 
  • Any remaining VM should be manually recreated following the apply changes with the command set "bosh -d {service_instance} recreate"

 

EXCEPTION for Recreate VMs step:

  • On newer BOSH and Stemcell versions, the "Recreate VMs deployed by the BOSH Director" checkbox may not be required
    • This applies ONLY IF: All VMs are deployed with stemcell Xenial 621.171 or later or Jammy 1.8 or later and Windows 2019.41 or later
    • AND IF: BOSH Director is on a version greater than 271.12+ and BOSH Agent is 2.388.0+ as detailed in BOSH nats-ca-rotation documentation
    • See the Operations Manager Release Notes to determine which BOSH Director version is in use
    • See the Stemcell Release Notes to see if the Stemcell in use is released after 1.8 release date
  • If your environment meets these requirements, a simple Upgrade of service-instances and Apply Changes applied to all tiles will suffice.

 

Note: In the upcoming links you will see "https://OPS-MANAGER-FQDN" which stands for "Ops Manager Fully Qualified Domain Name".

 

Prerequisites:


Get your Admin level Token

Prior to being able to utilize the "curl" commands you will need to use "uaac" authenticate and get your UAA Bearer Token. For more information, refer to the following documentation: Using the Ops Manager API.

 

Target the UAAC Implementation:

uaac target https://OPS-MANAGER-FQDN/uaa


Authenticate your UAAC:

$ uaac token owner get
#Example Output
Client ID: opsman
Client secret:
User name: admin <--- Your Opsman Login with Administrator scopes
Password: {Password}


Grab your Bearer token and make a variable named $token:

export token=`uaac contexts | grep access_token | awk '{print $2}'`


Note: Add -v to all my curl commands for a more Verbose output. This platform is operating under a Self Signed certificate and "-k" is utilized for "--skip-ssl-validation".

 


Checking your Certificates

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=TIME" \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"

 
Ops Manager web interface uses this query curl for the warning banner:

curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=3m" \
-H "Authorization: Bearer $token" 

 
Example:

curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=24m" \
-H "Authorization: Bearer $token" \
-kv | jq '.'


Note: Add the " jq" function that comes on all Ops Manager boxes.

 


Replacement:

 

Round 1

The New Root Certificate (CA2) will now be generated. We need to utilize the Credhub API to "generate" the new Root CA (CA2).

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/generate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'


Example:

curl "https://opsmgr.####.domain.com/api/v0/certificate_authorities/generate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv

 


Check Certificates

Check that a Status 200 was received back on that command. It is a good point here to check that the change was made.

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"


Example: 

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'

 


Current Status

There should now be 2 pairs of Root CA certificate / NATS certificate. The original Root CA (CA1) should still be active.

1. CA1 Pair

          • CA1 (active:true)
          • NATS1

2. CA2 Pair

          • CA2 (active:false)
          • NATS2

 

Apply Changes

The following Apply Changes propagates out the new Root CA certificate to each VM. Some VM types (such as service instances) will need to be manually recreated.

        • BOSH Director Tile > Director Config > "Recreate All VMs" must be checked. This checkbox re-set after every successful apply change and will need be "checked" every round. Make sure to save this change. 
        • The Apply Change must be on All tiles.
        • Any service instance tile should be run with the errands "Recreate all service instances" and "Update all service instances" enabled. 
        • Any remaining VM should be manually recreated following the apply changes with the command set "bosh -d {service_instance} recreate"

 


Round 2

Currently we have propagated the new Root Certificate (CA2) out to every VM. As the older Root Certificate is still active (CA1) we need to activate the newer Root Certificate (CA2) and then "regenerate" the non-configurable certificates off the new "active" Root Certificate (CA2).This Round is broken down into two distinct stages. 

 

Stage 1

Activate the New Root CA (CA2). This command utilizes the "certificate guid" featured on our "certificate authorities" query. 

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/CERTIFICATE-GUID/activate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'


Example:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/#####/activate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv


Check Certificates

Check that a Status 200 was received back on that command. It is a good point here to check that the change was made.

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"

 
Example: 

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'

 


Current Status

In the status output it should now be reflected that the active flag has switched for the CA in each pair.

1. CA1 Pair

        • CA1 (active:false)
        • NATS1

2. CA2 Pair

        • CA2 (active:true)
        • NATS2


Stage 2

Now that the New Root Certificate (CA2) has been "activated" we need to "regenerate" the non-configurable certificates so they validate off of (CA2). 

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'


Example:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv

 


Apply Changes

We now have a full set of certificates to populate out. We still have a non-expired Original Root Certificate (CA1) to validate the base communication with. We need to populate out the new NATS and DNS certificate components. This will require an Apply Changes on ALL tiles at once. 

        • BOSH Director Tile > Director Config > "Recreate All VMs" must be checked. This checkbox re-set after every successful apply change and will need be "checked" every round. Make sure to save this change (at the very bottom). 
        • The Apply Changes must be on All tiles.
        • Any service instance tile should be run with the errands "Recreate all service instances" and "Update all service instances" enabled. 
        • Any remaining VM should be manually recreated following the apply changes with the command set "bosh -d {service_instance} recreate"

If you would like to utilize a script we developed this will check most VM types for the location of the NATS certificate. It is a good idea to at least spot check that your platform VMs have been updated correctly. Any that are missed will become apparent in round 3 and will need the manual "recreate" step to be completed with the --fix option.  KB 297976 

 


Round 3

At this point you should still see two "nats_client_ca" entries in your "deployed certificate" query:

curl "https://opsmgr.####.domain.com/api/v0/deployed/certificates \
-H "Authorization: Bearer $token" \
-kv | jq '.'


This is because the Original Root CA is still be present as seen on the "certificate authorities" query.

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"


Example: 

curl "https://opsmgr.####.domain.com/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'


Using the above query, grab the "certificate guid" for the Original Root CA (CA1) so we may delete it. This would be the certificate featuring "active:false"

 


*Warning: After this step the original root certificate is not recoverable!*

*Deleting the Original Root CA (CA1) is required to remove the "Expiring certificates warning banner" if said banner is warning only on nats_client_ca. You can leave the cert until NATS expires. Any VM that has yet to be recreated will experience the same symptom as the CA1 being deleted when the Original NATS (NATS1) expires. This is why we recommend deleting under controlled circumstances.*

As the expiring NATS certificate is being flagged as an "Expiring certificate" via the "3m" (3 month) query featured at the start of this article, we must delete the Original Root CA / NATS CA pair (CA1/NATS1).
  
This step is not recoverable and any VM that has yet to be recreated will go into an "unresponsive agent" state after this step. Running a "bosh -d {service instance} recreate --fix" should repair the afflicted VM. 

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/OLD-CERTIFICATE-GUID" \
-X DELETE \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"


Example:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/#####" \
-X DELETE \
-H "Authorization: Bearer $token" \
-kv

 


Check Certificates

This is a good state the spot check the certificates. You should now only see CA2 listed. 

Command from Documentation:

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"


Example: 

curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'

 


Current Status

    1. CA2 Pair
      • CA2 (active:true)
      • NATS2

 

Apply Changes

It is now a good time to do an Apply Changes in a controlled environment. This way any VM that was missed in the Round 2 recreate all VM's component can be repaired. 

      • BOSH Director Tile > Director Config > "Recreate All VMs" must be checked. This checkbox re-set after every successful apply change and will need be "checked" every round. Make sure to save this change (At the very bottom). 
      • The Apply Change must be on All tiles.
      • Any service instance tile should be run with the errands "Recreate all service instances" and "Update all service instances" enabled. 
      • Any remaining VM should be manually recreated following the apply changes with the command set "bosh -d {service_instance} recreate"

If you are still viewing the "Warning, Certificates about to expire" batter present on your Ops Manager Web Interface, run the query to view what other certificates are expiring. These may be "configurable" (configurable:true) certificates that are in a tile or ones managed by other tiles as seen by the "property type" ("property_type": "{certificate_name}"). 

The Ops Manager web interface uses this query curl for the warning banner:

curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=3m" \
-H "Authorization: Bearer $token"