Rolling Deployments Hang in ACTIVE State and App Autoscaler Throws 422 Error

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

During an application deployment using the rolling strategy (cf push --strategy rolling), the deployment hangs in the ACTIVE state and eventually times out (e.g., hitting the 5-minute CF CLI timeout limits).

When this occurs, two distinct errors would be observed in the foundation logs:

1. Cloud Controller / Autoscaler Error:
The App Autoscaler service fails to bind or update the application, throwing a 422 Unprocessable Entity error:

Server error, status code: 422, error code: 10008, message: The request is semantically invalid: Cannot update app while a deployment is active.

2. Platform Worker Error (cc_deployment_updater.log):
Platform administrators will observe the following fatal database error looping continuously in /var/vcap/sys/log/cloud_controller_clock/cc_deployment_updater.log on the clock_global VM:

"error":"Sequel::Error", "error_message":"Process Guid: [GUID] This dataset does not support window functions"
"backtrace":".../gems/sequel-5.100.0/lib/sequel/dataset/sql.rb:793:in window_sql_append"

Environment

Tanzu Platform for Cloud Foundry(TPCF)/ Elastic Application Runtime(EAR) v10.2.x, v10.3.x and 10.4.x

Cause

This issue occurs due to an incompatibility between the App Autoscaler service and a constraint introduced in Cloud Controller API (CAPI) version 10.2.x.

In CAPI 10.2.x, a new safeguard was added that intentionally prevents external scaling operations while a rolling deployment is active. If a scale call is made during this time, CAPI rejects it with a 422 Unprocessable Entity ("deployment in flight") response code.

However, the App Autoscaler currently has no awareness of rolling deployment states and lacks specific handling for this 422 response. When CAPI rejects the scale request, the autoscaler's error handler maps the 422 code to a generic Failure error. Because this generic failure does not suppress retries, the autoscaler enters an infinite loop of repeated scale attempts.

This continuous barrage of rejected scaling requests overloads and stalls CAPI's rolling deployment state machine. Consequently, the platform is unable to finalize the deployment, leaving it hanging in the ACTIVE state until it eventually times out.

Resolution

Temporary Workaround (Clear the Deadlock):

If you are unable to upgrade immediately, the Platform Administration team must clear the deadlocked cc_deployment_updater process to unfreeze deployments.

Option 1 (Preferred): Recreate the clock_global VM
Perform a BOSH recreate of the clock_global VM (or the control VM in small-footprint architectures) to completely rebuild the server state and establish fresh database connections:

bosh -d <cf-deployment-name> recreate clock_global/0

Option 2: Restart the Process Manually
SSH into the clock_global VM via BOSH and restart the hung monit job:

monit restart cc_deployment_updater

Once the process is running healthily, it will process the backlog of stuck deployments. Autoscaler bindings and rolling deployments will immediately succeed until the deadlock is triggered again.

Permanent Fix (Upgrade TPCF):
A permanent fix for this autoscaler loop and deployment deadlock will be implemented in newer releases of the platform. Patch versions released after 10.2.11, 10.3.7, and 10.4.1 will have the permanent fix for this issue.

Additional Information

During a rolling deployment, the Autoscale events show errors: "Unable to scale due to Cloud Controller error" which it shouldn't. The Autoscaler should be aware of active deployments, rolling or canary, and avoid scaling actions while the deployment is active.
Broadcom Reference KB: The generic symptom of deployments hanging and eventually dropping into a canceling state due to clock_global overloads is documented in CF app deployment hangs in Active CANCELING status (Article 396187).