Support recommends a dedicated Certificate Authority (CA) for services (/services/tls_ca). Service Brokers use this CA to sign leaf certificates (certs) for TLS enabled service instances and it is located in BOSH CredHub.
When the need arises to rotate this CA and its leaf certs, understanding how service bound applications behave throughout the entire process helps shape a successful rotation strategy.
This article explores a variety of information about application behavior while rotating these certs.
Currently, there are a few different recommended rotation procedures for this CA and its leaf certs. The specific rotation method to use for your environment depends on factors such as the currently active deployed product versions. See Rotating the Services TLS CA and Its Leaf Certificates for more details.
If you are on Redis for VMware Tanzu v2.2 or v2.3, there is a known bug to be aware of when rotating this CA and its leaf certs, see Redis On Demand Services do not rotate /services/tls_ca leaf certs resulting in app downtime for more information.
When an on-demand TLS enabled service instance is created or when updating an existing non-TLS on-demand service instance to be TLS enabled, the following takes place:
1. BOSH generates a leaf cert signed by the /services/tls_ca CA.
2. BOSH then stores this new leaf cert in BOSH CredHub with the format /p-bosh/service-instance_<SERVICE-GUID>/unique-name-per-service-type. Then BOSH supplies the private key and certificates to the VMs in that service instance.
Note: The process of BOSH supplying the leaf cert to a service instance VM occurs during the updating instance step and it makes sure at the time that it is a leaf cert signed by the CA that sits in /services/tls_ca.
The new CA is not the current active signing /services/tls_ca. This is important to highlight because when you Apply Changes to distribute the new CA to the environment to be trusted, the old CA is still the current active signing /services/tls_ca. Remember, when a TLS enabled SI machine is created - BOSH ensures it has a leaf cert signed by the current active signing /services/tls_ca.
Distributing this CA to the environment will involve an Apply Changes.App-A sits on Diego-Cell-1, which has not updated yet.
Diego-Cell-2 gets updated first, which means Diego-Cell-2 will have both the old and new CA upon updating.
Diego-Cell-1 is next to update. In an effort to avoid application downtime, Diego issues a request to relocate the running containers on Diego-Cell-1 before updating it. Diego will then spin up an instance of App-A on Diego-Cell-2. Remember that Diego-Cell-2 has both the old and new CAs, so this means that when the new instance is created - it too will have a copy of both. Once the new instance is healthy, it then proceeds to kill the container on Diego-Cell-1 and the process continues as part of the rolling update.
Warning: Application instance downtime can temporarily occur during this step if the resources are not available for Diego to replace the running containers on the target Diego cell being updated. This is a general concept not specific to this certificate rotation process but it is worth mentioning here for visibility. When a Diego cell updates, a request to relocate the running instances on that cell occurs along with setting a timeout on them (this timeout is typically 10 minutes). If no other suitable Diego cells have the capacity to accept the replacement instances, then when the timeout expires on the initial instance - that instance will be unable to be recreated until the resources are available to Diego. A guideline to follow is ensuring there is enough free capacity to keep all application instances running with X number of Diego cells missing where X is the max-in-flight value.
Once the new CA is trusted in the environment, the next phase of the rotation procedure includes setting the new CA as the current active signing /services/tls_ca. After doing this, any TLS enabled service instance VM will be updated with a new leaf cert signed by that new CA only when the VM goes through the "updating instance" BOSH task for that VM.
At an Operations level, this is okay because both the old and new CA are in the environment as trusted so the machines will update properly.
At a Developer level, there is slight risk for application downtime for a small set of specific scenarios. For more information on these risks, read the section below.
Applications can potentially experience downtime after a service instance upgrade. Before diving into the scenarios where this can happen, lets refresh our understanding of how TLS works between a client and a TLS enabled server. The general workflow is this:
Let us relate these four steps to a Tanzu Application Service (TAS) for VMs deployed application instance bound to a TLS enabled service instance.
The above is true for VMware for RabbitMQ, VMware Tanzu GemFire, and Redis for VMware. However, it is not true for MySQL is slightly different.
To illustrate, consider how web traffic works. If a client makes a request to a web server on port 80, it will receive no certificate. If a client makes a request to a web server on port 443, it receives a certificate.
Similarly VMware for RabbitMQ, VMware Tanzu GemFire, and VMware for Redis listen on different ports for TLS and NON-TLS connection requests. When a client makes a request to one of these services on its TLS listening port, the server will send its TLS certificate, the leaf cert signed by /services/tls_ca, back to the client.
MySQL is a little different. It listens for TLS and NON-TLS connection requests on the same port.
If a client wants to establish a TLS connection with MySQL, it must proactively request this along with providing a CA cert(s) with the request to the server. This CA cert(s) is the CA that signed the server’s certificate. In regards to TAS for VMs, this is the /services/tls_ca.
Let's examine the VCAP_SERVICES environment variable for an application bound to MySQL.
Notice that there are two ways to communicate to this MySQL instance provided.
The jdbcUrl located at p.mysql.credentials.jdbcUrl.
A combination of host + port + user + password + database-name + ca_certificate all located under p.mysql.credentials.
Java and Spring applications only need to utilize the jdbcUrl for service connectivity. Notice the jdbcUrl connection string supplies /etc/ssl/certs/ca-certificates.crt with the request. This specific file includes all trusted certs in the container appended together. Therefore Java and Spring applications do not need to be unbound and rebound to the service instance because they do not rely on CA certificate information derived from the VCAP_SERVICES environmental variable.
Applications written in other languages (GO, Python, Node, Ruby) will not be able to utilize jdbcUrl as it is Java specific. However, it is possible to make a custom buildpack and use jdbc wrappers in other languages, though it is very uncommon. Most likely, these applications will use the information listed in option 2. This is why the CA cert is provided in VCAP_SERVICES, so an application can reference and utilize it when making a TLS connection request to a TLS enabled MySQL service instance. This means:
When a MySQL service instance is updated with a new leaf certificate, all applications that are not utilizing the jdbcUrl (essentially all non Java/Spring based apps) must be unbound and rebound to the service instance, then restaged to receive the new CA as part of the VCAP_SERVICES environmental variable. Until this happens, the applications can not establish a secure connection to the service and most likely will keep crashing until those steps () are performed.
Note: This is considering that the applications get the CA from VCAP_SERVICES. It is entirely possible to reference the contents from /etc/ssl/certs/ca-certificates.crt instead, if the language/framework permits it. MySQL accepts a list of CA certs, it doesn’t have to be only one. Here is a Node application implementation for comparison:
/* This utilizes the CA in VCAP_SERVICES - and implementing applications will need rebinding and restaging */ mysql_creds["ca_certificate"] = vcap_services["p.mysql"][0]["credentials"]["tls"]["cert"]["ca"];
/* This would utilize the list of trusted certificates in the trust store - implementing applications will not need rebinding or restaging. */ mysql_creds["ca_certificate"] = fs.readFileSync("/etc/ssl/certs/ca-certificates.crt");
Additionally, there are a few other things to consider in regards to applications losing connectivity after a service instance update. The following applies to all kinds of application frameworks as well as all the service types.
If the application doesn’t use the default trust store, make sure that the specified trust store contains the new /services/tls_ca CA so that the application can trust the connection.
If an application doesn’t have any reconnect logic in place or has a bug in the reconnect logic upon any connection errors, then the application may just need to be restarted in order to establish a fresh new connection.
For a successful strategy to update service instances with minimal to no application downtime, consider the following:
Most applications use the default trust store however if applications do not use the default trust store (they specify their own trust store), be sure that the new CA is known to specified trust store.
Note: This is specific only to MySQL bound applications - this does not apply to the other service types. If the application is not Java or Spring based - and utilizes the CA cert provided in VCAP_SERVICES - The application must be unbound, rebound, restaged in order to receive the new CA cert. Until this happens the application will be unable to communicate with the service.
If any other service bound applications are crashed after the service instance updates, try restarting the applications as it may just be the application’s logic to reconnect upon a connection error.
Ensure that Diego has enough free capacity to keep all application instances running with X number of Diego cells missing, where X is the max-in-flight value.