While rotating the postgres_ca certificate for a Concourse deployment, the deploy operation fails on the web VM.
Symptoms:
The web VM shows as failing:
# bosh -d <Concourse Deployment Name> vms Using environment '10.###.###.###' as client 'ops_manager' Task 42. Done Deployment 'concourse' Instance Process State AZ IPs VM CID VM Type Active Stemcell db/fc208407-####-####-####-ec10e0ec6699 running az1 10.###.##.### vm-40f1a839-8481-4c12-89d1-f49db3c7ccf2 large true bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376 web/60633910-####-####-####-6b5b06c2f7ab failing az1 10.###.##.### vm-25b0c491-f483-4740-8089-8668e166eb4b medium.disk true bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376 worker/4cf9c3a3-####-####-####-98a7a74e62d6 running az1 10.###.##.### vm-98d7456a-2d15-4cac-b586-badb53aaa70c large.disk true bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376 3 vms Succeeded
The web job on the web VM is in an unknown state:
# bosh -d <Concourse Deployment Name> instances --ps Using environment '10.###.###.###' as client 'ops_manager' Task 42. Done Deployment 'concourse' Instance Process Process State AZ IPs Deployment ... web/60633910-####-####-####-6b5b06c2f7ab - failing az1 10.###.###.### concourse ~ bosh-dns running - - - ~ bosh-dns-healthcheck running - - - ~ bosh-dns-resolvconf running - - - ~ credhub running - - - ~ system-metrics-agent running - - - ~ uaa running - - - ~ web unknown - - - ... 3 instances Succeeded
The the log file /var/vcap/sys/log/web/web.stderr.log contains the following error message:
error: failed to connect to database: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "postgresCA")
Cause:
During the rotation procedure for the CA certificates, a deploy was done after the newly generated CA certificate was activated but before the new leaf certificates were regenerated.
Due to how the Concourse BOSH deployment internal secure Postgres operations file is designed, performing a deploy in this state causes the deployment to fail because the Postgres database server is still using a certificate signed by the old, now transitional, CA. Because of how the operations file is written, the web VM will only trust one Postgres CA at a time.
Impacted Versions:
All current version of the Concourse BOSH Deployment for VMware Tanzu can be impacted by this issue.
maestro regenerate leaf --signed-by /p-bosh/<Concourse Deployment Name>/postgres_ca --skip-safety-check
credhub bulk-regenerate --signed-by=/p-bosh/<Concourse Deployment Name>/postgres_ca