Concourse Web Job Fails during rotation of postgres_ca certificate
search cancel

Concourse Web Job Fails during rotation of postgres_ca certificate

book

Article ID: 297237

calendar_today

Updated On:

Products

Concourse for VMware Tanzu

Issue/Introduction

While rotating the postgres_ca certificate for a Concourse deployment, the deploy operation fails on the web VM.

Symptoms:

The web VM shows as failing:

# bosh -d <Concourse Deployment Name> vms
Using environment '10.###.###.###' as client 'ops_manager'

Task 42. Done

Deployment 'concourse'

Instance                                     Process State  AZ   IPs            VM CID                                   VM Type      Active  Stemcell
db/fc208407-####-####-####-ec10e0ec6699      running        az1  10.###.##.###  vm-40f1a839-8481-4c12-89d1-f49db3c7ccf2  large        true    bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376
web/60633910-####-####-####-6b5b06c2f7ab     failing        az1  10.###.##.###  vm-25b0c491-f483-4740-8089-8668e166eb4b  medium.disk  true    bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376
worker/4cf9c3a3-####-####-####-98a7a74e62d6  running        az1  10.###.##.###  vm-98d7456a-2d15-4cac-b586-badb53aaa70c  large.disk   true    bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.376

3 vms

Succeeded


The web job on the web VM is in an unknown state:

# bosh -d <Concourse Deployment Name> instances --ps
Using environment '10.###.###.###' as client 'ops_manager'

Task 42. Done

Deployment 'concourse'

Instance                                             Process               Process State  AZ   IPs            Deployment
...
web/60633910-####-####-####-6b5b06c2f7ab             -                     failing        az1  10.###.###.###  concourse
~                                                    bosh-dns              running        -    -              -
~                                                    bosh-dns-healthcheck  running        -    -              -
~                                                    bosh-dns-resolvconf   running        -    -              -
~                                                    credhub               running        -    -              -
~                                                    system-metrics-agent  running        -    -              -
~                                                    uaa                   running        -    -              -
~                                                    web                   unknown        -    -              -
...
3 instances

Succeeded


The the log file /var/vcap/sys/log/web/web.stderr.log contains the following error message:

error: failed to connect to database: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "postgresCA")


Cause:
During the rotation procedure for the CA certificates, a deploy was done after the newly generated CA certificate was activated but before the new leaf certificates were regenerated.

Due to how the Concourse BOSH deployment internal secure Postgres operations file is designed, performing a deploy in this state causes the deployment to fail because the Postgres database server is still using a certificate signed by the old, now transitional, CA. Because of how the operations file is written, the web VM will only trust one Postgres CA at a time.

Impacted Versions:
All current version of the Concourse BOSH Deployment for VMware Tanzu can be impacted by this issue.

Environment

Product Version: 6.7

Resolution

Solution 1:
For Concourse deployments running under an Opsman managed BOSH director, please run the following Maestro CLI command to regenerate the leaf certificates using the new version of the postgres_ca certificate:
maestro regenerate leaf --signed-by /p-bosh/<Concourse Deployment Name>/postgres_ca --skip-safety-check

Please note that the --skip-safety-check flag is required as the Concourse deployment VMs will be split between trusting the new and previous versions of the postgres_ca certificate which will cause the Maestro CLI to always generate a safety violation.

Solution 2:
In the event that using the Maestro CLI is not possible, such as Concourse deployments running under an open source BOSH director, use the following Credhub CLI command to regenerate the leaf certificates:
 
credhub bulk-regenerate --signed-by=/p-bosh/<Concourse Deployment Name>/postgres_ca

Once either solution has been implemented, proceed with re-deploying Concourse.