1. CF Async Operation Failures
When a CAPI async operation is requested, the Cloud Controller serializes that specific request into a job object and places it into the database. The serialized job is then picked up by a Cloud Controller Worker, deserialized, and then executed.
A CAPI operational optimization resulted in slight changes to the job object. This means that a job created and serialized by a TAS v2.7 Cloud Controller can not be deserialized by a TAS v2.6 Cloud Controller Worker.
During the
Apply Changes, the Cloud Controller instance group updates before the Cloud Controller Worker instance group.
The impact can be visualized below:
1. Cloud Controller updates from v2.6 to v2.7
2. An async CAPI operation request results in a job object serialized by the updated Cloud Controller and then placed in database
3. A v2.6 Cloud Controller Worker picks up the job object but fails to deserialize as it has not yet updated with the new spec for the job
4. Cloud Controller Workers will continue to execute this every 5 minutes until either they are all updated or the job expires after 24 hours
Potential Duration of Impact
When the first Cloud Controller updates from v2.6 to v2.7 until the last Cloud Controller Worker updates from v2.6 to v2.7.
When a Cloud Controller Worker fails to deserialize a job, the error message will look similar to the following:
FAILED permanently with Delayed::DeserializationError: Job failed to load: undefined class/module VCAP::CloudController
2. Application Staging Failures
Prior to TAS v2.7, the
file_server job (located in the Diego Brain instance group) only listened on
port 8080. A security enhancement for Diego introduced an additional
TLS port (
8447) for the
file_server to listen on.
https_server_enabled: true
https_listen_addr: 0.0.0.0:8447
https_url: https://file-server.service.cf.internal:8447
The Cloud Controller instance group contains a
property pointing to the
file_server endpoint in the Diego Brain. Once a Cloud Controller updates to v2.7, the new https
file_server endpoint will used by that Cloud Controller. This impacts the staging process as the
file_server will not be able to be reached.
The error message will look similar to the following:
{"timestamp":"2020-08-18T00:45:03.379028646Z","level":"info","source":"bbs","message":"bbs.request.reject-task.reject-task.reject-task","data":{"guid":"f482284d-bdbe-4b91-8769-143f0667df12","rejection-reason":"failed to download cached artifacts","session":"3009.1.1"}}
Staging will continue to fail potentially until all the Diego Brains are updated. Staging will be sporadically successful if the request hits an un-upgraded Cloud Controller or if one of the Diego Brains has been updated. Staging impact should fully be resolved when all Diego Brains have been updated.
The impact can be visualized below:
1. Cloud Controller updates from v2.6 to v2.7
2. A developer pushes an app and the CAPI request (
cf push) hits the Cloud Controller that has been updated
3. That Cloud Controller will need to communicate with the
file_server job as part of the staging process but its request gets refused by an un-upgraded Diego Brain. This is because the un-upgraded Diego Brain isn't listening on the new
https port yet
4. The
cf push fails
Potential Duration of Impact
When the first Cloud Controller updates from v2.6 to v2.7 until the last Diego Brain updates from v2.6 to v2.7.