This Knowledge Base (KB) article provides a set of instructions on how to perform a Hard Root CA Rotation on a platform where the Root CA has either expired or been deleted. This information applies to Tanzu Application Service for VMs (TAS for VMs) versions 2.7 or above.
The method is called "Hard" because it will cause some downtime on the platform during the period it recreates the VMs. Doing the rotation through the normal method is the "Soft" rotation because it prevents downtime.
These steps can be used to recreate all the CF deployments VMs to use current Active Root CA. Run these commands in the Ops Manager box using a newly created /var/tempest/workspaces/default/deployments/cf-#####.yml deployment manifest. This will differ from "bosh recreate --fix" because it "deploys" the new manifest with the newly created Root CA.
These commands will reference OPS-MANAGER-FQDN. This stands for "Ops Manager Fully Qualified Domain Name". For example: https://ops-manager-url.com.
The first task is to create the new root certificates. The steps to create a new root CA can be found in this documentation:
Rotating Certificates: https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-rotate-cas-and-leaf-certs.html
These commands are intended to be used on the Ops Manager vm. Instructions on connecting to the Ops Manager vm via ssh - https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-ssh-login.html
Steps:
1. Disable the bosh resurrector by running "bosh update-resurrection off" .
- As we bring vm's online we do not want the system to automatically recreate any vm.
2. Prior to being able to utilize the "curl" commands you will need to use "uaac" authenticate and get your UAA BearerToken. https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-ops-man-api.html#log-in-to-ops-manager-5
Get your Admin level Token
Target the UAAC Implementation:
uaac target https://OPS-MANAGER-FQDN/uaa
Authenticate your UAAC:
$ uaac token owner get #Example Output Client ID: opsman Client secret: User name: admin <--- Your Opsman Login with Administrator scopes Password: {Password}
Grab your Bearer token and make a variable named $token:
export token=`uaac context | grep access_token | awk '{print $2}'`
3. Check for expired certificates: https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-check-expiration.html
4. Generate a new root CA. https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-rotate-cas-and-leaf-certs.html#step-1-add-new-cas-3
5. Mark the new root CA as Active.
6. Regenerate the non-configurable certificates using Active CA.
The newly created Root CA Cert needs to be incorporated into a new deployment manifest. This manifest will be used by BOSH to re-deploy.
Steps:
Note: Recreate All VM's is not necessary AFTER 2.10.20
7. Select the "BOSH Director Tile" and in the "Director Config" select the checkbox "Recreate all VM's".
Note: The "Recreate All VM's" checkbox resets after every successful "Apply Changes".
8. Select the "Tanzu Application Service Tile" and in the "Resource Config" scale the Diego BBS (diego_database) down to one. The high availability orchestration of the diego_database cluster, which uses a "locket" layer to "gossip" status updates to each other, is similar to the "galera" layer on the MySQL cluster. This cluster differs however in that it pulls its information from the mysql cluster on scale back up so we don't need to preserve any resources. Scaling down at the start will remove extra steps later on. If you forget this step and it fails on the manual deploy in BOSH to follow you can repair by editing the manifest (cf-###.yml) from "instances:3" to "instances:1" for diego_database and then run "deploy" again.
9. Click on "Review Pending Changes" and then "Apply Changes".
10. This Apply Changes will fail on the first TAS for VMs deployment VM recreate which is to be expected due to the invalid CA cert.
11. Modify the newly created cf-####.yml to be readable. This will be on the Ops Manager VM and can be modified using the command "cd /var/tempest/workspaces/default/deployments/ && sudo chmod a+r *.yml"
Note: Skip step 12 if you have a High Availability MySQL cluster (three or more) and use the following section on TAS for VMs MySQL Clusters of 3 or more
12. In the deployments directory on the Opsman VM where the manifest resides (/var/tempest/workspaces/default/deployments/) run "bosh -d cf-#### deploy cf-####.yml --fix". This will recreate each VM and deploy the new CA cert.
13. After the completion of the manual BOSH deployment you must run an Apply Changes. This will confirm all components are up to date.
14. Enable the bosh resurrector by running "bosh update-resurrection on"
If you have a High Availability MySQL Cluster in your TAS for VMs deployment the first "deploy" run will fail on the TAS for VMs MySQL cluster. In TAS version 2.7 and above the "monit" process startup will fail and resolve to a single process "localhost". The solution to this is to combine the "deploy" method with selective VM "ignore" / "deploy" (which creates the VM). The newly created VM will feature the new Root CA certificate. This method is designed to provide a safety net in case something goes wrong with VM recreation.
Steps:
All 3 should now be in a "failed" state. We need to find the clusters leader.
All 3 should now be in "running" state. This can be verified using the same mysql-diag command from earlier. Please re-engage with step 13 from earlier.
Warning: Do not modify the Manifest (at any point) to have 1 MySQL VM instead of the High Availability 3. This is a much higher risk as it deletes 2 VMs disk prior to performing the recreate. We have seen issues with the IAAS and attaching the disk to the first MySQL VM so we do not recommend this version. It is mentioned primarily so you know the reason you do not want to reduce to 1 VM. We prefer to do the delete in controlled circumstances.
Service instance Tiles
Once the TAS for VMs deployment is happy, we move on to the service instances. This would be for tiles such as MySQL, Redis, RabbitMQ, and Spring. There may be other tiles not listed, the way to know is any that requires communication with the TAS for VMs deployment will need to be re-done.
Steps:
Note: In a large TAS environment (100 VMs+), VMware Support can help to speed up the scan performed by the BOSH Director to all bosh-agents in "unresponsive_agent" state, please contact VMware Support.