Hard Root CA Rotation for TAS 2.7+
search cancel

Hard Root CA Rotation for TAS 2.7+

book

Article ID: 298006

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When a Root CA expires or is deleted, all the VMs go into an "unresponsive agent" state. Apply Changes fails on the first VM it tries to recreate.

This Knowledge Base (KB) article provides a set of instructions on how to perform a Hard Root CA Rotation on a platform where the Root CA has either expired or been deleted. This version is for Tanzu Application Service versions 2.7 or above.

Environment

Product Version: 2.7

Resolution

Hard Root CA Rotation for TAS 2.7+

This Knowledge Base (KB) article provides a set of instructions on how to perform a Hard Root CA Rotation on a platform where the Root CA has either expired or been deleted. This information applies to Tanzu Application Service for VMs (TAS for VMs) versions 2.7 or above. 

The method is called "Hard" because it will cause some downtime on the platform during the period it recreates the VMs. Doing the rotation through the normal method is the "Soft" rotation because it prevents downtime. 

These steps can be used to recreate all the CF deployments VMs to use current Active Root CA. Run these commands in the Ops Manager box using a newly created /var/tempest/workspaces/default/deployments/cf-#####.yml deployment manifest. This will differ from "bosh recreate --fix" because it "deploys" the new manifest with the newly created Root CA. 

These commands will reference OPS-MANAGER-FQDN. This stands for "Ops Manager Fully Qualified Domain Name". For example: https://ops-manager-url.com.


Prepare New Root Certificate

The first task is to create the new root certificates. The steps to create a new root CA can be found in this documentation: 

 Rotating Certificates: https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-rotate-cas-and-leaf-certs.html 


These commands are intended to be used on the Ops Manager vm. Instructions on connecting to the Ops Manager vm via ssh - https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-ssh-login.html  

Steps:
1. Disable the bosh resurrector by running "bosh update-resurrection off" .

 - As we bring vm's online we do not want the system to automatically recreate any vm. 

2. Prior to being able to utilize the "curl" commands you will need to use "uaac" authenticate and get your UAA BearerToken.  https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/install-ops-man-api.html#log-in-to-ops-manager-5 

Get your Admin level Token

Target the UAAC Implementation:

uaac target https://OPS-MANAGER-FQDN/uaa

Authenticate your UAAC:

$ uaac token owner get
#Example Output
Client ID: opsman
Client secret:
User name: admin <--- Your Opsman Login with Administrator scopes
Password: {Password}

Grab your Bearer token and make a variable named $token:

export token=`uaac context | grep access_token | awk '{print $2}'`


3. Check for expired certificates: https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-check-expiration.html

  • Either run the curl command or utilize a browser to view "OPS-MANAGER-FQDN/api/v0/certificate_authorities"
  • Repeat this step as your terminal output to verify the completion of each remaining step.

4. Generate a new root CA. https://docs.vmware.com/en/VMware-Tanzu-Operations-Manager/3.0/vmware-tanzu-ops-manager/security-pcf-infrastructure-rotate-cas-and-leaf-certs.html#step-1-add-new-cas-3

  • This step will use a curl command utilizing your bearer token from step #2.
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/generate
  • Check the certificate was created using step #3 ("OPS-MANAGER-FQDN/api/v0/certificate_authorities" )
  • It will feature "active:false". 
  • As part of this procedure, skip the "Apply Changes" portion of the rotation instructions

5. Mark the new root CA as Active.

  • The curl command will utilize the bearer token from step #2.
  • The command will "activate" the newly created certificate by utilizing the listed entries guid.
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/CERTIFICATE-GUID/activate
  • Check the certificate was activated using step #3

6. Regenerate the non-configurable certificates using Active CA.

  • The curl command will utilize the bearer token from step #2.
  • The command will "regenerate" all of the non-configurable certificates for the platform. 
  • The URL for the curl will be "OPS-MANAGER-FQDN/api/v0/certificate_authorities/active/regenerate"

The newly created Root CA Cert needs to be incorporated into a new deployment manifest. This manifest will be used by BOSH to re-deploy.

Steps:

Note: Recreate All VM's is not necessary AFTER 2.10.20 
7. Select the "BOSH Director Tile" and in the "Director Config" select the checkbox "Recreate all VM's".
Note: The "Recreate All VM's" checkbox resets after every successful "Apply Changes".

8. Select the "Tanzu Application Service Tile" and in the "Resource Config" scale the Diego BBS (diego_database) down to one. The high availability orchestration of the diego_database cluster, which uses a "locket" layer to "gossip" status updates to each other, is similar to the "galera" layer on the MySQL cluster. This cluster differs however in that it pulls its information from the mysql cluster on scale back up so we don't need to preserve any resources. Scaling down at the start will remove extra steps later on. If you forget this step and it fails on the manual deploy in BOSH to follow you can repair by editing the manifest (cf-###.yml) from "instances:3" to "instances:1" for diego_database and then run "deploy" again. 

9. Click on "Review Pending Changes" and then "Apply Changes".

10. This Apply Changes will fail on the first TAS for VMs deployment VM recreate which is to be expected due to the invalid CA cert.

11. Modify the newly created cf-####.yml to be readable. This will be on the Ops Manager VM and can be modified using the command "cd /var/tempest/workspaces/default/deployments/ && sudo chmod a+r *.yml"

Note: Skip step 12 if you have a High Availability MySQL cluster (three or more) and use the following section on TAS for VMs MySQL Clusters of 3 or more

12. In the deployments directory on the Opsman VM where the manifest resides (/var/tempest/workspaces/default/deployments/) run "bosh -d cf-#### deploy cf-####.yml --fix". This will recreate each VM and deploy the new CA cert. 

13. After the completion of the manual BOSH deployment you must run an Apply Changes. This will confirm all components are up to date. 

14. Enable the bosh resurrector by running "bosh update-resurrection on
 

TAS for VMs MySQL Clusters of 3 or more

If you have a High Availability MySQL Cluster in your TAS for VMs deployment the first "deploy" run will fail on the TAS for VMs MySQL cluster. In TAS version 2.7 and above the "monit" process startup will fail and resolve to a single process "localhost". The solution to this is to combine the "deploy" method with selective VM "ignore" / "deploy" (which creates the VM). The newly created VM will feature the new Root CA certificate. This method is designed to provide a safety net in case something goes wrong with VM recreation. 

Steps:

  • Use a unique terminal to ssh into the "mysql-monitor" VM and run "mysql-diag" to verify the cluster status prior to this step as we want to avoid any other complications. Please use the following documentation: https://docs.vmware.com/en/VMware-Tanzu-Application-Service/6.0/tas-for-vms/mysql-diag.html. Leave this terminal open to verify VM state. 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/0.
  • Make note of the guid for mysql/0. We will use the guid for bosh commands as "0_guid"
  • After the failure, ignore VM 1 with the command "bosh ignore mysql/0_guid"
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/1
  • After the failure, ignore VM 2 with the command "bosh ignore mysql/1_guid"
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". This will fail immediately on mysql/2
  • After the failure, ignore VM 3 with the command "bosh ignore mysql/2_guid"

 All 3 should now be in a "failed" state. We need to find the clusters leader.

  • We can use "bosh ssh mysql/0_guid" to log into each MySQL VM to determine which VM to bootstrap as the leader of the cluster. It is best to open 3 terminals so as to know which VM is array number "/0", "/1", or "/2". These commands are featured on the Bootstrap documentation (https://docs.vmware.com/en/VMware-Tanzu-Application-Service/6.0/tas-for-vms/bootstrap-mysql.html#bootstrap-manually-6 )
  • Verify all 3 VMs have the same state of "-1" by running "cat /var/vcap/store/pxc-mysql/grastate.dat | grep 'seqno:'" in all 3 terminals.
  • Find which of the 3 VMs has the highest sequence number by running "/var/vcap/packages/pxc/bin/mysqld --defaults-file=/var/vcap/jobs/pxc-mysql/config/my.cnf --wsrep-recover" to write the number to a log.
  • Look at the end of the prior log for the last set of numbers using "grep" Recovered position "/var/vcap/sys/log/pxc-mysql/mysql.err.log | tail -1" and making note of the final number following the colon ":". 
  • The VM with the highest sequence number is considered our best VM and the one we wish to use. This is the Cluster's Leader. The other two VMs will sync to this one during their recreation. Make note of this VM's guid and array number
  • On the MySQL VM with the highest sequence number run "sudo echo -n "NEEDS_BOOTSTRAP" > /var/vcap/store/pxc-mysql/state.txt"  
  • Run "bosh unignore mysql /{#_guid}" as this is the first vm we are going to modify, the cluster leader vm.
  • Verify that the other 2 VM's are ignored by running "bosh -d cf-xxxx.yml instances --details | grep mysql" and look at the "Ignored" column 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix". 
  • Upon completion of this manual deploy you should have 1 "running" MySQL VM.
  • Run "bosh unignore mysql/{other 2 vms}". This will allow the 2 VMs to be re-created with the new root CA and then join the cluster. 
  • Run "bosh -d cf-#### deploy cf-####.yml --fix" to create the remaining 2 VMs. 

All 3 should now be in "running" state. This can be verified using the same mysql-diag command from earlier. Please re-engage with step 13 from earlier. 

Warning: Do not modify the Manifest (at any point) to have 1 MySQL VM instead of the High Availability 3. This is a much higher risk as it deletes 2 VMs disk prior to performing the recreate. We have seen issues with the IAAS and attaching the disk to the first MySQL VM so we do not recommend this version. It is mentioned primarily so you know the reason you do not want to reduce to 1 VM. We prefer to do the delete in controlled circumstances. 


Service instance Tiles

Once the TAS for VMs deployment is happy, we move on to the service instances. This would be for tiles such as MySQL, Redis, RabbitMQ, and Spring. There may be other tiles not listed, the way to know is any that requires communication with the TAS for VMs deployment will need to be re-done.

Steps:

  • Run an Apply Changes on the BOSH director and the Service tile with "Recreate All On-Demand Service Instances" (Not necessary on Ops Manager 2.10.20+) and "Upgrade All On-Demand Service Instances" selected. Some service tiles have an Errand for this, such as MySQL and RabbitMQ. Make sure it is selected to these errands on al tiles that have a bosh deployment. 
  • If the Apply Changes fails, contact VMware Support


Note: In a large TAS environment (100 VMs+), VMware Support can help to speed up the scan performed by the BOSH Director to all bosh-agents in "unresponsive_agent" state, please contact VMware Support.