Understanding application behavior while rotating /services/tls_ca leaf certs
search cancel

Understanding application behavior while rotating /services/tls_ca leaf certs

book

Article ID: 293418

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Support recommends a dedicated Certificate Authority (CA) for services (/services/tls_ca). Service Brokers use this CA to sign leaf certificates (certs) for TLS enabled service instances and it is located in BOSH CredHub.

When the need arises to rotate this CA and its leaf certs, understanding how service bound applications behave throughout the entire process helps shape a successful rotation strategy.

This article explores a variety of information about application behavior while rotating these certs.


Important

Currently, there are a few different recommended rotation procedures for this CA and its leaf certs. The specific rotation method to use for your environment depends on factors such as the currently active deployed product versions. See Rotating the Services TLS CA and Its Leaf Certificates for more details. 

If you are on Redis for VMware Tanzu v2.2 or v2.3, there is a known bug to be aware of when rotating this CA and its leaf certs, see Redis On Demand Services do not rotate /services/tls_ca leaf certs resulting in app downtime for more information.


High level overview of /services/tls_ca

When an on-demand TLS enabled service instance is created or when updating an existing non-TLS on-demand service instance to be TLS enabled, the following takes place:

1. BOSH generates a leaf cert signed by the /services/tls_ca CA. 

2. BOSH then stores this new leaf cert in BOSH CredHub with the format /p-bosh/service-instance_<SERVICE-GUID>/unique-name-per-service-type. Then BOSH supplies the private key and certificates to the VMs in that service instance.

Note: The process of BOSH supplying the leaf cert to a service instance VM occurs during the updating instance step and it makes sure at the time that it is a leaf cert signed by the CA that sits in /services/tls_ca.

Resolution

All rotation procedures share two common phases in their steps:

Phase 1 - The introduction of the new CA to the environment so its future leaf certs can be trusted.

Phase 2 - Setting the new CA as the current active signing /services/tls_ca and rotating the leaf certs.


Phase 1 Explained

The new CA is not the current active signing /services/tls_ca. This is important to highlight because when you Apply Changes to distribute the new CA to the environment to be trusted, the old CA is still the current active signing /services/tls_ca. Remember, when a TLS enabled SI machine is created - BOSH ensures it has a leaf cert signed by the current active signing /services/tls_ca.

Distributing this CA to the environment will involve an Apply Changes

During this first Apply Changes, the original /services/tls_ca is still the signing CA. This is good because if we had changed /services/tls_ca, then the upgrade-all-service-instances errand would have rotated leaf certs in this Apply Changes. Up to this point, no changes to the leaf certs has occurred yet, thus applications remain unaffected in regards to service connectivity. Additionally, we did not remove the old CA from the trusted certificates list in the BOSH/TAS/iso tiles. In fact, after the Apply Changes completes, all applications will have access to both the old and new CA due to the rolling of the Diego Cells. 

Diego, as part of the container creation, populates the directory /etc/ssl/certs with known trusted certificate authorities for the foundation. This is the applications default trust store and can be viewed if you ssh inside of a running application via $ cf ssh <APPNAME>. The /etc/ssl/certs/ca-certificates.crt file contains all of them appended together. 

You can now visualize how an application gets both CAs in it's trust store with this workflow:
 
  1. App-A sits on Diego-Cell-1, which has not updated yet.

  2. Diego-Cell-2 gets updated first, which means Diego-Cell-2 will have both the old and new CA upon updating.

  3. Diego-Cell-1 is next to update. In an effort to avoid application downtime, Diego issues a request to relocate the running containers on Diego-Cell-1 before updating it. Diego will then spin up an instance of App-A on Diego-Cell-2. Remember that Diego-Cell-2 has both the old and new CAs, so this means that when the new instance is created - it too will have a copy of both. Once the new instance is healthy, it then proceeds to kill the container on Diego-Cell-1 and the process continues as part of the rolling update. 

Warning: Application instance downtime can temporarily occur during this step if the resources are not available for Diego to replace the running containers on the target Diego cell being updated. This is a general concept not specific to this certificate rotation process but it is worth mentioning here for visibility. When a Diego cell updates, a request to relocate the running instances on that cell occurs along with setting a timeout on them (this timeout is typically 10 minutes). If no other suitable Diego cells have the capacity to accept the replacement instances, then when the timeout expires on the initial instance - that instance will be unable to be recreated until the resources are available to Diego. A guideline to follow is ensuring there is enough free capacity to keep all application instances running with X number of Diego cells missing where X is the max-in-flight value.
 

Phase 2 Explained

Once the new CA is trusted in the environment, the next phase of the rotation procedure includes setting the new CA as the current active signing /services/tls_ca. After doing this, any TLS enabled service instance VM will be updated with a new leaf cert signed by that new CA only when the VM goes through the "updating instance" BOSH task for that VM.

At an Operations level, this is okay because both the old and new CA are in the environment as trusted so the machines will update properly.

At a Developer level, there is slight risk for application downtime for a small set of specific scenarios. For more information on these risks, read the section below.


Application Caveats During Leaf Cert Rotation

Applications can potentially experience downtime after a service instance upgrade. Before diving into the scenarios where this can happen, lets refresh our understanding of how TLS works between a client and a TLS enabled server. The general workflow is this:
 

  1. The client requests a connection to the server (Specifically to the TLS listening port). 
  2. The server sends back a certificate to the client.
  3. The client verifies if the certificate is signed by a trusted CA.
  4. If the client can trust the CA, a TLS connection can take place.

Let us relate these four steps to a Tanzu Application Service (TAS) for VMs deployed application instance bound to a TLS enabled service instance.
 

  1. An application instance (client) requests a connection to a TLS enabled service instance (server). 
  2. The TLS enabled service instance returns its TLS certificate. This is the leaf cert that was generated from the CA /services/tls_ca.
  3. The application instance then compares this certificate to its trust store to verify if it was signed by a known CA. Remember, Diego places trusted certificates in the default trust store located within the container at /etc/ssl/certs.
  4. If the container trust store has the CA that the server’s certificate was signed by, a TLS connection can proceed. If not, there will be a handshake error.

The above is true for VMware for RabbitMQ, VMware Tanzu GemFire, and Redis for VMware. However, it is not true for MySQL is slightly different.

To illustrate, consider how web traffic works. If a client makes a request to a web server on port 80, it will receive no certificate. If a client makes a request to a web server on port 443, it receives a certificate. 

Similarly VMware for RabbitMQ, VMware Tanzu GemFire, and VMware for Redis listen on different ports for TLS and NON-TLS connection requests. When a client makes a request to one of these services on its TLS listening port, the server will send its TLS certificate, the leaf cert signed by /services/tls_ca, back to the client.

MySQL is a little different. It listens for TLS and NON-TLS connection requests on the same port.

If a client wants to establish a TLS connection with MySQL, it must proactively request this along with providing a CA cert(s) with the request to the server. This CA cert(s) is the CA that signed the server’s certificate. In regards to TAS for VMs, this is the /services/tls_ca.

Let's examine the VCAP_SERVICES environment variable for an application bound to MySQL.
 


Notice that there are two ways to communicate to this MySQL instance provided.
 

  1. The jdbcUrl located at p.mysql.credentials.jdbcUrl.

  2. A combination of host + port + user + password + database-name + ca_certificate all located under p.mysql.credentials

Java and Spring applications only need to utilize the jdbcUrl for service connectivity. Notice the jdbcUrl connection string supplies /etc/ssl/certs/ca-certificates.crt with the request. This specific file includes all trusted certs in the container appended together. Therefore Java and Spring applications do not need to be unbound and rebound to the service instance because they do not rely on CA certificate information derived from the VCAP_SERVICES environmental variable.

Applications written in other languages (GO, Python, Node, Ruby) will not be able to utilize jdbcUrl as it is Java specific. However, it is possible to make a custom buildpack and use jdbc wrappers in other languages, though it is very uncommon. Most likely, these applications will use the information listed in option 2. This is why the CA cert is provided in VCAP_SERVICES, so an application can reference and utilize it when making a TLS connection request to a TLS enabled MySQL service instance. This means:
 

  • When a MySQL service instance is updated with a new leaf certificate, all applications that are not utilizing the jdbcUrl (essentially all non Java/Spring based apps) must be unbound and rebound to the service instance, then restaged to receive the new CA as part of the VCAP_SERVICES environmental variable. Until this happens, the applications can not establish a secure connection to the service and most likely will keep crashing until those steps ()  are performed.

Note: This is considering that the applications get the CA from VCAP_SERVICES. It is entirely possible to reference the contents from /etc/ssl/certs/ca-certificates.crt instead, if the language/framework permits it. MySQL accepts a list of CA certs, it doesn’t have to be only one. Here is a Node application implementation for comparison:

/* This utilizes the CA in VCAP_SERVICES - and implementing applications will need rebinding and restaging */
mysql_creds["ca_certificate"] = vcap_services["p.mysql"][0]["credentials"]["tls"]["cert"]["ca"];

 

/* This would utilize the list of trusted certificates in the trust store - implementing applications will not need rebinding or restaging. */
mysql_creds["ca_certificate"] = fs.readFileSync("/etc/ssl/certs/ca-certificates.crt");


Additionally, there are a few other things to consider in regards to applications losing connectivity after a service instance update. The following applies to all kinds of application frameworks as well as all the service types.
 

  • If the application doesn’t use the default trust store, make sure that the specified trust store contains the new /services/tls_ca CA so that the application can trust the connection.

  • If an application doesn’t have any reconnect logic in place or has a bug in the reconnect logic upon any connection errors, then the application may just need to be restarted in order to establish a fresh new connection. 


Conclusion

For a successful strategy to update service instances with minimal to no application downtime, consider the following:
 

  • Most applications use the default trust store however if applications do not use the default trust store (they specify their own trust store), be sure that the new CA is known to specified trust store.

Note: This is specific only to MySQL bound applications -  this does not apply to the other service types. If the application is not Java or Spring based - and utilizes the CA cert provided in VCAP_SERVICES - The application must be unbound, rebound, restaged in order to receive the new CA cert. Until this happens the application will be unable to communicate with the service.

  • If any other service bound applications are crashed after the service instance updates, try restarting the applications as it may just be the application’s logic to reconnect upon a connection error. 

  • Ensure that Diego has enough free capacity to keep all application instances running with X number of Diego cells missing, where X is the max-in-flight value.