In platforms that have been updated from versions installed prior to Operations Manager (Ops Manager) 2.2, originally created NATS certificates expire after 2 years. This can cause VM downtime if it expires.
The recommended rotation procedure requires a full Root CA rotation. If the platform is Ops Manager 2.2 or above, The newly created NATS certificate will be a 4 year certificate to match the 4 year Root CA certificate.
The purpose of this article is to highlight the repair path to rotate a NATS certificate with a 2 year expiration and utilize the API Certificate Rotation method shown in this documentation: Rotate the Root CA and Leaf Certificates.
The NATS client certificate (nats_client_ca) is integrally paired to the current active Root CA certificate. To create a new NATS cert (NATS2) you must also create a new Root CA Certificate (CA2).
This procedure in production settings is best broken up into 3 rounds:
After each stage there will be an Apply Changes. This apply change will feature a few similarities each round:
EXCEPTION for Recreate VMs step:
Note: In the upcoming links you will see "https://OPS-MANAGER-FQDN" which stands for "Ops Manager Fully Qualified Domain Name".
Prior to being able to utilize the "curl" commands you will need to use "uaac" authenticate and get your UAA Bearer Token. For more information, refer to the following documentation: Using the Ops Manager API.
Target the UAAC Implementation:
uaac target https://OPS-MANAGER-FQDN/uaa
Authenticate your UAAC:
$ uaac token owner get
#Example Output
Client ID: opsman
Client secret:
User name: admin <--- Your Opsman Login with Administrator scopes
Password: {Password}
Grab your Bearer token and make a variable named $token:
export token=`uaac contexts | grep access_token | awk '{print $2}'`
Note: Add -v to all my curl commands for a more Verbose output. This platform is operating under a Self Signed certificate and "-k" is utilized for "--skip-ssl-validation".
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=TIME" \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Ops Manager web interface uses this query curl for the warning banner:
curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=3m" \
-H "Authorization: Bearer $token"
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=24m" \
-H "Authorization: Bearer $token" \
-kv | jq '.'
Note: Add the " jq" function that comes on all Ops Manager boxes.
The New Root Certificate (CA2) will now be generated. We need to utilize the Credhub API to "generate" the new Root CA (CA2).
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/generate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
Example:
curl "https://opsmgr.####.domain.com/api/v0/certificate_authorities/generate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv
Check that a Status 200 was received back on that command. It is a good point here to check that the change was made.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'
There should now be 2 pairs of Root CA certificate / NATS certificate. The original Root CA (CA1) should still be active.
1. CA1 Pair
2. CA2 Pair
The following Apply Changes propagates out the new Root CA certificate to each VM. Some VM types (such as service instances) will need to be manually recreated.
Currently we have propagated the new Root Certificate (CA2) out to every VM. As the older Root Certificate is still active (CA1) we need to activate the newer Root Certificate (CA2) and then "regenerate" the non-configurable certificates off the new "active" Root Certificate (CA2).This Round is broken down into two distinct stages.
Activate the New Root CA (CA2). This command utilizes the "certificate guid" featured on our "certificate authorities" query.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/CERTIFICATE-GUID/activate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/#####/activate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv
Check that a Status 200 was received back on that command. It is a good point here to check that the change was made.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'
In the status output it should now be reflected that the active flag has switched for the CA in each pair.
1. CA1 Pair
2. CA2 Pair
Now that the New Root Certificate (CA2) has been "activated" we need to "regenerate" the non-configurable certificates so they validate off of (CA2).
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate" \
-X POST \
-H "Authorization: Bearer UAA-ACCESS-TOKEN" \
-H "Content-Type: application/json" \
-d '{}'
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate" \
-X POST \
-H "Authorization: Bearer $token" \
-H "Content-Type: application/json" \
-d '{}' \
-kv
We now have a full set of certificates to populate out. We still have a non-expired Original Root Certificate (CA1) to validate the base communication with. We need to populate out the new NATS and DNS certificate components. This will require an Apply Changes on ALL tiles at once.
If you would like to utilize a script we developed this will check most VM types for the location of the NATS certificate. It is a good idea to at least spot check that your platform VMs have been updated correctly. Any that are missed will become apparent in round 3 and will need the manual "recreate" step to be completed with the --fix option. KB 297976
At this point you should still see two "nats_client_ca" entries in your "deployed certificate" query:
curl "https://opsmgr.####.domain.com/api/v0/deployed/certificates \
-H "Authorization: Bearer $token" \
-kv | jq '.'
This is because the Original Root CA is still be present as seen on the "certificate authorities" query.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Example:
curl "https://opsmgr.####.domain.com/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'
Using the above query, grab the "certificate guid" for the Original Root CA (CA1) so we may delete it. This would be the certificate featuring "active:false"
*Deleting the Original Root CA (CA1) is required to remove the "Expiring certificates warning banner" if said banner is warning only on nats_client_ca. You can leave the cert until NATS expires. Any VM that has yet to be recreated will experience the same symptom as the CA1 being deleted when the Original NATS (NATS1) expires. This is why we recommend deleting under controlled circumstances.*
As the expiring NATS certificate is being flagged as an "Expiring certificate" via the "3m" (3 month) query featured at the start of this article, we must delete the Original Root CA / NATS CA pair (CA1/NATS1).
This step is not recoverable and any VM that has yet to be recreated will go into an "unresponsive agent" state after this step. Running a "bosh -d {service instance} recreate --fix" should repair the afflicted VM.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/OLD-CERTIFICATE-GUID" \
-X DELETE \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities/#####" \
-X DELETE \
-H "Authorization: Bearer $token" \
-kv
This is a good state the spot check the certificates. You should now only see CA2 listed.
Command from Documentation:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-
X GET \
-H "Authorization: Bearer UAA-ACCESS-TOKEN"
Example:
curl "https://OPS-MANAGER-FQDN/api/v0/certificate_authorities" \
-X GET \
-H "Authorization: Bearer $token" \
-kv | jq '.'
It is now a good time to do an Apply Changes in a controlled environment. This way any VM that was missed in the Round 2 recreate all VM's component can be repaired.
If you are still viewing the "Warning, Certificates about to expire" batter present on your Ops Manager Web Interface, run the query to view what other certificates are expiring. These may be "configurable" (configurable:true) certificates that are in a tile or ones managed by other tiles as seen by the "property type" ("property_type": "{certificate_name}").
The Ops Manager web interface uses this query curl for the warning banner:
curl "https://OPS-MANAGER-FQDN/api/v0/deployed/certificates?expires_within=3m" \
-H "Authorization: Bearer $token"