Bosh-health-check interfering with certificate rotation
search cancel

Bosh-health-check interfering with certificate rotation

book

Article ID: 297427

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When performing the rotation of non-configurable leaf certificates as described in the documentation , one can encounter a "failed to regenerate leaf certificates" error after calling the Ops Manager API /api/v0/certificate_authorities/active/regenerate endpoint. The error may present itself in the following way:

{
"certificates": {
"regenerated": [],
"excluded": [],
"regenerate_failed": []
},
"safety_violations": [
{
"violation": "there is more than one signing version of a certificate authority",
"certificate_names": [
"/opsmgr/bosh_dns/tls_ca"
]
}
],
"errors": [
"failed to regenerate leaf certificates"
]
}

 

The output can prove itself to be misleading since we are presented with a "there is more than one signing version of a certificate authority" safety violation which is well documented in our official documentation. However, the violation can be misleading if the following are true:

  1. You are in the middle of a certificate rotation and you have generated a new version of /opsmgr/bosh_dns/tls_ca certificate or any of its leafs, but have not yet ran Apply Changes which propagates the new version of a certificate to the deployments that use it.
  2. Healthwatch is performing a health check.
If you attempt to call the Ops Manager API /api/v0/certificate_authorities/active/regenerate endpoint when the above are true, this results in a race condition.

Resolution

BOSH performs the health check every 10 minutes. The app bosh-health-check, which is part of Healthwatch product, performs the following task every 10 minutes:

  1. Create a deployment bosh-health
  2. Create a VM bosh-health-check
  3. Create a container on the above VM
  4. Delete the above container
  5. Delete a VM bosh-health-check
  6. Delete a deployment bosh-health

The actual health check takes about 30 seconds. Once BOSH creates the deployment bosh-health, it will use the newest version of /opsmgr/bosh_dns/tls_ca certificate and proceed with performing the health check. If you execute maestro topology --name /opsmgr/bosh_dns/tls_ca when health check is in progress, the output will be the following:

topology:
    - name: /opsmgr/bosh_dns/tls_ca
      certificate_id: 51e5a00a-8678-433f-8691-a43e2829765f
      signed_by: /opsmgr/bosh_dns/tls_ca
      versions:
        - version_id: 17d262db-a3b5-4e16-9dcc-5cb83fded06d
          active: true    <-------------------------------------- notice how the new version is now active
          deployment_names:
            - bosh-health <-------------------------------------- only one deployment using the new version
          signing: true
          certificate_authority: true
          generated: true
          valid_until: 2027-10-10T17:21:39Z
        - version_id: 88cccf62-e915-4424-b49e-bb7a5eb2b055
          active: true <----------------------------------------- notice how the old version is also active
          deployment_names: <------------------------------------ all other deployments are still using the old version
            - appMetrics-a0a1ece9204ce16d4b76
            - cf-0aeaa5a9fd7974c9262d
            - metric-store-d1053e87f58c4c4ca7f6
            - pas-windows-5bc09f68978a5084cc89
            - pas-windows-dmz-25743293b686e73bf957
            - p-healthwatch2-d636ef07d4d55c5df71a
            - p-healthwatch2-pas-exporter-724b029cf82039f20a46
            - p-isolation-segment-dmz-ecd77da9eacb4abb3870
            - p_spring-cloud-services-f98adbaa348e0164996a
          signing: true
          transitional: true
          certificate_authority: true
          valid_until: 2022-11-27T09:11:44Z


The above output states that there are now two versions of /opsmgr/bosh_dns/tls_ca certificate that are active but only the bosh-health deployment uses the newer version of the cert. If you execute maestro topology --name /opsmgr/bosh_dns/tls_ca when health check is complete, the bosh-health deployment will not exist anymore and the output will be the following:

topology:
    - name: /opsmgr/bosh_dns/tls_ca
      certificate_id: 51e5a00a-8678-433f-8691-a43e2829765f
      signed_by: /opsmgr/bosh_dns/tls_ca
      versions:
        - version_id: 17d262db-a3b5-4e16-9dcc-5cb83fded06d
          certificate_authority: true <-------------------------- notice how the new version is not active anymore
          generated: true
          valid_until: 2027-10-10T17:21:39Z
        - version_id: 88cccf62-e915-4424-b49e-bb7a5eb2b055
          active: true
          deployment_names:
            - appMetrics-a0a1ece9204ce16d4b76
            - cf-0aeaa5a9fd7974c9262d
            - metric-store-d1053e87f58c4c4ca7f6
            - pas-windows-5bc09f68978a5084cc89
            - pas-windows-dmz-25743293b686e73bf957
            - p-healthwatch2-d636ef07d4d55c5df71a
            - p-healthwatch2-pas-exporter-724b029cf82039f20a46
            - p-isolation-segment-dmz-ecd77da9eacb4abb3870
            - p_spring-cloud-services-f98adbaa348e0164996a
          signing: true
          transitional: true
          certificate_authority: true
          valid_until: 2022-11-27T09:11:44Z


Conclusion

This is an edge case since the health check takes about 30 seconds to complete in a window of 10 minutes. Once Healthwatch has completed the health check and deleted the deployment, it will de-activate the newest version of the certificate and BOSH will resume to trust the older version of the cert which is the only one active version now. Since we now have only one active version, the previously mentioned safety violation will not be triggered anymore and you should be able to call the Ops Manager API /api/v0/certificate_authorities/active/regenerate endpoint to proceed with the certificate rotation.

After rotation has been complete, Apply Changes should propagate the newer version of the certificate in all the deployments that use it and the older version of the cert will be deactivated.