Hard Operations Manager Root/NATS CA Rotation
search cancel

Hard Operations Manager Root/NATS CA Rotation

book

Article ID: 298006

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Operations Manager root CA and BOSH director NATS CA are managed as single set, they will expire and are rotated at the same timing. When the CA expired, 
  • In Operations Manager certificate view, no certificates are listed.
  • All BOSH-deployed instances are at "unresponsive agent" state. 
  • Any "Apply Change" would fail with error below: 
Deploying:
Creating instance 'bosh/0'
Post "https://vcap:<redacted>@x.x.x.x:6868/agent": tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2025-XX-XXTXX:XX:XXZ is after 2025-XX-XXTXX:XX:XXZ
Exit code 1
This Knowledge Base (KB) article provides a set of instructions on how to perform a hard root CA rotation on a platform when Operations Manager root/NATS CA expired.

Environment

  • Operations Manager
  • Elastic Application Runtime
  • Service Tiles

Resolution

The method is called "Hard" because it will cause some downtime on the platform during the period it recreates the VMs. Doing the rotation through the normal method is the "Soft" rotation because it prevents downtime. These steps can be used to recreate all unresponsive BOSH-deployed VMs to use current active root/NATS CA.

Run these commands in the Ops Manager box using a newly created /var/tempest/workspaces/default/deployments/cf-#####.yml deployment manifest. This will differ from "bosh recreate --fix" because it "deploys" the new manifest with the newly created Root CA. 

These commands will reference OPS-MANAGER-FQDN. This stands for "Ops Manager Fully Qualified Domain Name". For example: https://ops-manager-url.com.


Prepare New Root/NATS CA Certificate

The first task is to create the new root certificates. The steps to create a new CA can be found in this documentation

1. Disable the bosh resurrector by running "bosh update-resurrection off" .

 - As we bring vm's online we do not want the system to automatically recreate any vm. 

2. Prior to being able to utilize the "curl" you will need to get UAA token with uaac.

Target the UAAC Implementation:

uaac target https://OPS-MANAGER-FQDN/uaa

Authenticate with Operations Manager admin account:

$ uaac token owner get
#Example Output
Client ID: opsman
Client secret:
User name: admin <--- Your Opsman Login with Administrator scopes
Password: {Password}

Grab your access token and make a variable named $token:

export token=`uaac context | grep access_token | awk '{print $2}'`


3. Check expired certificates

  • Either run curl or view "OPS-MANAGER-FQDN/api/v0/certificate_authorities" view browser.
  • Repeat this step as your terminal output to verify the completion of each remaining step.

4. Generate a new root CA

  • This step will use a curl command utilizing your bearer token from step #2.
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/generate
  • Check the certificate was created using step #3 ("OPS-MANAGER-FQDN/api/v0/certificate_authorities" )
  • It will feature "active:false". 
  • As part of this procedure, skip the "Apply Changes" portion of the rotation instructions

5. Mark the new root CA as Active.

  • The curl command will utilize the bearer token from step #2.
  • The command will "activate" the newly created certificate by utilizing the listed entries guid.
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/CERTIFICATE-GUID/activate
  • Check the certificate was activated using step #3

6. Regenerate the non-configurable certificates using Active CA.

  • The curl command will utilize the bearer token from step #2.
  • The command will "regenerate" all of the non-configurable certificates for the platform. 
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate"

The newly created Root CA Cert needs to be incorporated into a new deployment manifest. This manifest will be used by BOSH to re-deploy.

Deploy the new CA to Elastic Application Runtime (EAR) 

1. Select the "BOSH Director Tile" and in the "Director Config" select the checkbox "Recreate all VM's".
Note: The "Recreate All VM's" checkbox resets after every successful "Apply Changes".

2. Select the "Elastic Application Runtime" and in the "Resource Config" scale the Diego BBS (diego_database) down to one. The high availability orchestration of the diego_database cluster, which uses a "locket" layer to "gossip" status updates to each other, is similar to the "galera" layer on the MySQL cluster. This cluster differs however in that it pulls its information from the mysql cluster on scale back up so we don't need to preserve any resources. Scaling down at the start will remove extra steps later on. If you forget this step and it fails on the manual deploy in BOSH to follow you can repair by editing the manifest (cf-###.yml) from "instances:3" to "instances:1" for diego_database and then run "deploy" again. 

3. Click on "Review Pending Changes" and then "Apply Changes".

4. This Apply Changes will fail on the first EAR deployment VM recreate which is to be expected due to the invalid CA cert.

5. Fetch EAR deployment manifest with "bosh -d cf-#### manifest > cf-####.yml"

Note: Skip step 12 if you have a High Availability MySQL cluster (three or more) and use the following section on EAR MySQL Clusters of 3 or more

6. In the deployments directory on the Opsman VM where the manifest resides (/var/tempest/workspaces/default/deployments/) run "bosh -d cf-#### deploy cf-####.yml --fix". This will recreate each VM and deploy the new CA cert. 

7. After the completion of the manual BOSH deployment you must run an Apply Changes. This will confirm all components are up to date. 

If EAR has 3 node-MySQL Cluster 

If you have a high availability MySQL cluster in EAR deployment, the first "deploy" run will fail at EAR MySQL cluster. The "monit" process startup will fail and resolve to a single process "localhost". The solution to this is to combine the "deploy" method with selective VM "ignore" / "deploy" (which creates the VM). The newly created VM will feature the new Root CA certificate. This method is designed to provide a safety net in case something goes wrong with VM recreation. 

Steps:

  • Use a unique terminal to ssh into the "mysql-monitor" VM and run "mysql-diag" to verify the cluster status prior to this step as we want to avoid any other complications. Please use the following documentation: https://techdocs.broadcom.com/us/en/vmware-tanzu/platform/tanzu-platform-for-cloud-foundry/6-0/tpcf/mysql-diag.html. Leave this terminal open to verify VM state. 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/0.
  • Make note of the guid for mysql/0. We will use the guid for bosh commands as "0_guid"
  • After the failure, ignore VM 1 with the command "bosh ignore mysql/0_guid"
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/1
  • After the failure, ignore VM 2 with the command "bosh ignore mysql/1_guid"
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/2
  • After the failure, ignore VM 3 with the command "bosh ignore mysql/2_guid"

 All 3 should now be in a "failed" state. We need to find the clusters leader.

  • We can use "bosh ssh mysql/0_guid" to log into each MySQL VM to determine which VM to bootstrap as the leader of the cluster. It is best to open 3 terminals so as to know which VM is array number "/0", "/1", or "/2". These commands are featured on the Bootstrap documentation (https://techdocs.broadcom.com/us/en/vmware-tanzu/platform/tanzu-platform-for-cloud-foundry/6-0/tpcf/bootstrap-mysql.html#bootstrap-manually-6 )
  • Verify all 3 VMs have the same state of "-1" by running "cat /var/vcap/store/pxc-mysql/grastate.dat | grep 'seqno:'" in all 3 terminals.
  • Find which of the 3 VMs has the highest sequence number by running "/var/vcap/packages/pxc/bin/mysqld --defaults-file=/var/vcap/jobs/pxc-mysql/config/my.cnf --wsrep-recover" to write the number to a log.
  • Look at the end of the prior log for the last set of numbers using "grep" Recovered position "/var/vcap/sys/log/pxc-mysql/mysql.err.log | tail -1" and making note of the final number following the colon ":". 
  • The VM with the highest sequence number is considered our best VM and the one we wish to use. This is the Cluster's Leader. The other two VMs will sync to this one during their recreation. Make note of this VM's guid and array number
  • On the MySQL VM with the highest sequence number run "sudo echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/pxc-mysql/state.txt"  
  • Run "bosh unignore mysql /{#_guid}" as this is the first vm we are going to modify, the cluster leader vm.
  • Verify that the other 2 VM's are ignored by running "bosh -d cf-xxxx.yml instances --details | grep mysql" and look at the "Ignored" column 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". 
  • Upon completion of this manual deploy you should have 1 "running" MySQL VM.
  • Run "bosh unignore mysql/{other 2 vms}". This will allow the 2 VMs to be re-created with the new root CA and then join the cluster. 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix" to create the remaining 2 VMs. 

All 3 should now be in "running" state. This can be verified using the same mysql-diag command from earlier. Please re-engage with step 13 from earlier. 

Warning: Do not modify the Manifest (at any point) to have 1 MySQL VM instead of the high availability 3. This is a much higher risk as it deletes 2 VMs disk prior to performing the recreate. We have seen issues with the IAAS and attaching the disk to the first MySQL VM so we do not recommend this version. It is mentioned primarily so you know the reason you do not want to reduce to 1 VM. We prefer to do the delete in controlled circumstances. 

Deploy the new CA to On-Demand Service

Once the EAR deployment completes, we move on to the service instances. This would be for tiles such as MySQL, Redis, RabbitMQ, TKGI and Spring. There may be other tiles not listed, the way to know is any that requires communication with the EAR deployment will need to be re-done.

1. Run Apply Changes and the select the Service tile with "Upgrade All On-Demand Service Instances" on. Some service tiles have an Errand for this, such as MySQL and RabbitMQ. Make sure it is selected to these errands on all tiles that have a bosh deployment. 

2. The deployment is supposed to fail when NATS CA expired because VMs are at unresponsive state, please follow the similar steps as EAR. "bosh deploy" with --fix will resolve unresponsive VMs. 

  1. "bosh -d SERVICE_TILE_DEPLOYMENT manifest >SERVICE_TILE_DEPLOYMENT.yml"
  2. "bosh -d SERVICE_TILE_DEPLOYMENT deploy --fix SERVICE_TILE_DEPLOYMENT.yml"

"Upgrade All On-Demand Service Instances" may fail due to the same reason with service instance VMs.  Please resolve it by using "bosh deploy" with --fix as well. 

  1. "bosh -d service-intance_#### manifest >service-intance_####.yml"
  2. "bosh -d service-intance_#### deploy --fix service-intance_####.yml"

Finally please turn on resurrection when the CA is successfully rotated - "bosh update-resurrection on".


If you encounter any problem, please contact Broadcom support by opening a support request.