Upgrade From TAS v2.6 to 2.7 CAPI Impact Overview

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

There are two important CAPI impacts to be aware of while upgrading from TAS v2.6 to TAS v2.7. These impacts do not last the entire duration of the Apply Changes, instead they occur when certain instance groups begin to update.

1. CAPI async operation impact
2. App Staging impact

The duration of each impact depend on factors such as environment size and product configurations like max-in-flight.

Environment

Product Version: 2.7

Resolution

1. CF Async Operation Failures

When a CAPI async operation is requested, the Cloud Controller serializes that specific request into a job object and places it into the database. The serialized job is then picked up by a Cloud Controller Worker, deserialized, and then executed.
A CAPI operational optimization resulted in slight changes to the job object. This means that a job created and serialized by a TAS v2.7 Cloud Controller can not be deserialized by a TAS v2.6 Cloud Controller Worker.

During the Apply Changes, the Cloud Controller instance group updates before the Cloud Controller Worker instance group.

The impact can be visualized below:

1. Cloud Controller updates from v2.6 to v2.7
2. An async CAPI operation request results in a job object serialized by the updated Cloud Controller and then placed in database
3. A v2.6 Cloud Controller Worker picks up the job object but fails to deserialize as it has not yet updated with the new spec for the job
4. Cloud Controller Workers will continue to execute this every 5 minutes until either they are all updated or the job expires after 24 hours

Potential Duration of Impact

When the first Cloud Controller updates from v2.6 to v2.7 until the last Cloud Controller Worker updates from v2.6 to v2.7.

When a Cloud Controller Worker fails to deserialize a job, the error message will look similar to the following:

FAILED permanently with Delayed::DeserializationError: Job failed to load: undefined class/module VCAP::CloudController

2. Application Staging Failures

Prior to TAS v2.7, the file_server job (located in the Diego Brain instance group) only listened on port 8080. A security enhancement for Diego introduced an additional TLS port (8447) for the file_server to listen on.

https_server_enabled: true
https_listen_addr: 0.0.0.0:8447
https_url: https://file-server.service.cf.internal:8447

The Cloud Controller instance group contains a property pointing to the file_server endpoint in the Diego Brain. Once a Cloud Controller updates to v2.7, the new https file_server endpoint will used by that Cloud Controller. This impacts the staging process as the file_server will not be able to be reached.

The error message will look similar to the following:

{"timestamp":"2020-08-18T00:45:03.379028646Z","level":"info","source":"bbs","message":"bbs.request.reject-task.reject-task.reject-task","data":{"guid":"f482284d-bdbe-4b91-8769-143f0667df12","rejection-reason":"failed to download cached artifacts","session":"3009.1.1"}}

Staging will continue to fail potentially until all the Diego Brains are updated. Staging will be sporadically successful if the request hits an un-upgraded Cloud Controller or if one of the Diego Brains has been updated. Staging impact should fully be resolved when all Diego Brains have been updated.

The impact can be visualized below:

1. Cloud Controller updates from v2.6 to v2.7
2. A developer pushes an app and the CAPI request (cf push) hits the Cloud Controller that has been updated
3. That Cloud Controller will need to communicate with the file_server job as part of the staging process but its request gets refused by an un-upgraded Diego Brain. This is because the un-upgraded Diego Brain isn't listening on the new https port yet
4. The cf push fails

Potential Duration of Impact

When the first Cloud Controller updates from v2.6 to v2.7 until the last Diego Brain updates from v2.6 to v2.7.