When performing the rotation of non-configurable leaf certificates as described in the documentation , one can encounter a "failed to regenerate leaf certificates" error after calling the Ops Manager API /api/v0/certificate_authorities/active/regenerate endpoint. The error may present itself in the following way:
{ "certificates": { "regenerated": [], "excluded": [], "regenerate_failed": [] }, "safety_violations": [ { "violation": "there is more than one signing version of a certificate authority", "certificate_names": [ "/opsmgr/bosh_dns/tls_ca" ] } ], "errors": [ "failed to regenerate leaf certificates" ] }
The output can prove itself to be misleading since we are presented with a "there is more than one signing version of a certificate authority" safety violation which is well documented in our official documentation. However, the violation can be misleading if the following are true:
BOSH performs the health check every 10 minutes. The app bosh-health-check, which is part of Healthwatch product, performs the following task every 10 minutes:
The actual health check takes about 30 seconds. Once BOSH creates the deployment bosh-health, it will use the newest version of /opsmgr/bosh_dns/tls_ca certificate and proceed with performing the health check. If you execute maestro topology --name /opsmgr/bosh_dns/tls_ca when health check is in progress, the output will be the following:
topology: - name: /opsmgr/bosh_dns/tls_ca certificate_id: 51e5a00a-8678-433f-8691-a43e2829765f signed_by: /opsmgr/bosh_dns/tls_ca versions: - version_id: 17d262db-a3b5-4e16-9dcc-5cb83fded06d active: true <-------------------------------------- notice how the new version is now active deployment_names: - bosh-health <-------------------------------------- only one deployment using the new version signing: true certificate_authority: true generated: true valid_until: 2027-10-10T17:21:39Z - version_id: 88cccf62-e915-4424-b49e-bb7a5eb2b055 active: true <----------------------------------------- notice how the old version is also active deployment_names: <------------------------------------ all other deployments are still using the old version - appMetrics-a0a1ece9204ce16d4b76 - cf-0aeaa5a9fd7974c9262d - metric-store-d1053e87f58c4c4ca7f6 - pas-windows-5bc09f68978a5084cc89 - pas-windows-dmz-25743293b686e73bf957 - p-healthwatch2-d636ef07d4d55c5df71a - p-healthwatch2-pas-exporter-724b029cf82039f20a46 - p-isolation-segment-dmz-ecd77da9eacb4abb3870 - p_spring-cloud-services-f98adbaa348e0164996a signing: true transitional: true certificate_authority: true valid_until: 2022-11-27T09:11:44Z
The above output states that there are now two versions of /opsmgr/bosh_dns/tls_ca certificate that are active but only the bosh-health deployment uses the newer version of the cert. If you execute maestro topology --name /opsmgr/bosh_dns/tls_ca when health check is complete, the bosh-health deployment will not exist anymore and the output will be the following:
topology: - name: /opsmgr/bosh_dns/tls_ca certificate_id: 51e5a00a-8678-433f-8691-a43e2829765f signed_by: /opsmgr/bosh_dns/tls_ca versions: - version_id: 17d262db-a3b5-4e16-9dcc-5cb83fded06d certificate_authority: true <-------------------------- notice how the new version is not active anymore generated: true valid_until: 2027-10-10T17:21:39Z - version_id: 88cccf62-e915-4424-b49e-bb7a5eb2b055 active: true deployment_names: - appMetrics-a0a1ece9204ce16d4b76 - cf-0aeaa5a9fd7974c9262d - metric-store-d1053e87f58c4c4ca7f6 - pas-windows-5bc09f68978a5084cc89 - pas-windows-dmz-25743293b686e73bf957 - p-healthwatch2-d636ef07d4d55c5df71a - p-healthwatch2-pas-exporter-724b029cf82039f20a46 - p-isolation-segment-dmz-ecd77da9eacb4abb3870 - p_spring-cloud-services-f98adbaa348e0164996a signing: true transitional: true certificate_authority: true valid_until: 2022-11-27T09:11:44Z
This is an edge case since the health check takes about 30 seconds to complete in a window of 10 minutes. Once Healthwatch has completed the health check and deleted the deployment, it will de-activate the newest version of the certificate and BOSH will resume to trust the older version of the cert which is the only one active version now. Since we now have only one active version, the previously mentioned safety violation will not be triggered anymore and you should be able to call the Ops Manager API /api/v0/certificate_authorities/active/regenerate endpoint to proceed with the certificate rotation.
After rotation has been complete, Apply Changes should propagate the newer version of the certificate in all the deployments that use it and the older version of the cert will be deactivated.