Control VMs become unresponsive or services flap on EART deployed VMs

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

When using the the Elastic Application Runtime (EART) tile in a Small Footprint deployment (or standard), you may observe the following symptoms:

Control VMs show a status of unresponsive agent, down, or failing when checked via the BOSH CLI.
Monit services such as bbs, tps-watcher, locket, and auctioneer frequently flap between running and failing states.
- Checking logging for these services, you will see context timeouts or connection errors to the database.
Logging examples on Control VMs:
- /var/vcap/sys/log/locket/locket.stdout.log:
  
  {"timestamp":"2026-01-25T02:45:52.327639900Z","level":"error","source":"locket","message":"locket.lock.failed-locking-lock","data":{"error":"context canceled","key":"routing_api_lock","owner":"########-####-####-####-67f034e993eb","request-uuid":"########-####-####-####-444f55c7c6cf","session":"######"}}
- /var/vcap/sys/log/tps/watcher.stdout.log:
  
  {"timestamp":"2026-01-25T02:47:06.263184188Z","level":"error","source":"tps-watcher","message":"tps-watcher.locket-lock.lost-lock","data":{"duration":1110226274,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"tps_watcher","owner":"tps-watcher-########-####-####-####-67f034e993eb","type":"lock","type_code":1},"request-uuid":"########-####-####-####-f0c9f127297a","session":"1","ttl_in_seconds":15}}
- /var/vcap/sys/log/auctioneer/auctioneer.stdout.log:
  
  {"timestamp":"2026-01-25T02:51:23.112453971Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":1014657756,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"auctioneer","owner":"########-####-####-####-67f034e993eb","type":"lock","type_code":1},"request-uuid":"########-####-####-####-013e10ea10a4","session":"2","ttl_in_seconds":15}}
- /var/vcap/sys/log/bbs/bbs.stdout.log:
  
  {"timestamp":"2026-01-25T02:15:22.292830197Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock","data":{"duration":2004939430,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"bbs","owner":"########-####-####-####-67f034e993eb","type":"lock","type_code":1},"request-uuid":"########-####-####-####-2a33af8c13f0","session":"5","ttl_in_seconds":15}}
Recreating the VMs using bosh recreate does not permanently resolve the issue.
MySQL VM slow query logs show high Lock_time (greater than 15) or InnoDB_rec_lock_wait exceeding 15–20 seconds for the locket schema.
- These can be found on the Database VM in the /var/vcap/sys/log/pxc-mysql/mysql_slow_query log
In cases of significant resource limitation, you might see all BOSH managed VMs show unresponsive agent status. This is an indication of CPU or Memory starvation in the vSphere Resource Pool the VMs are deployed under.

Environment

This problem is independent of version. The symptoms are caused by environmental factors.

Cause

The instability is caused by MySQL row lock contention within the locket database schema.

Key processes like the BBS use Locket to maintain leadership in an active-standby High Availability (HA) model. In EAR environments, resource constraints (such as CPU and Memory limitations at the vSphere Resource Pool level, or, storage latency) can cause MySQL queries to take longer than 15 seconds to complete.

When these queries exceed the timeout threshold:

The active process believes it has lost its lock.
Standby processes attempt to promote themselves as the new leader.
This creates a cascading effect of lock expiry, cleanup, and increased contention, leading to "flapping" services and unresponsive VMs.
In CPU or Memory limited environments, this failure might impact the bosh-agent on the managed VMs, leading to unresponsive agent status in worst case scenarios.

Resolution

To resolve the service instability, you must remove the resource bottlenecks impacting the MySQL VMs.

1. Identify and Remove Resource Pool Limits

Investigate the vSphere Resource Pools (RP) where the Tanzu Platform VMs are deployed. Ensure that there are no "Limit" configurations set for CPU or Memory that might be throttling the VMs during high-load operations like tile updates.

2. Increase Resource Allocation

If the environment is under-provisioned, increase the CPU and Memory allocation for the MySQL VM. Ensure the underlying hardware can sustain the required IOPS and CPU cycles without significant latency.

3. Recreate Unresponsive VMs

This step is only required if resource starvation leads to VMs with unresponsive agent status.

Once resource limits are removed, if the VMs don't automatically return to healthy status, it might be necessary to recreate the VMs to re-establish bosh-agent connectivity to Bosh director. Use the BOSH CLI per deployment to recreate any VMs that might be reporting as unresponsive:

Gather the current deployment manifest:

Redeploy with the --fix flag:

4. Verification

Monitor the mysql_slow_query logs and ensure that Query_time and Lock_time for the locket schema have returned to normal levels (typically sub-second).

Additional Information

Note similar issues in MySQL Clustered environments