Stale route in Gorouter due to Diego Cell drain timeout

search cancel

Stale route in Gorouter due to Diego Cell drain timeout

book

Article ID: 426429

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

If an application fails to replace during diego_cell draining - its route is never unregistered from gorouter.
Gorouter cleans it up later either by route integrity pruning or restarting gorouter to rebuild the routing table.
During the drain, replacement instances are unable to be replaced because of no available capacity.
Once the drain timeout hits, the app instances are stopped and shutdown, but there doesn't seem to be a route unregistration message emitted for gorouter to consume.

Environment

This has been observed in TAS Elastic Application Runtime 6.0.x environments, but maybe seen in 10.x environments as well.

Cause

Any delay/latency in blobstore access, or resource contention on Diego cells that might prevent app instances from starting on a new cell within the default 10 minute timeout for Diego drain operations might cause a Diego cell drain timeout. App instances running on the Diego cell attempting to drain will be forcefully killed. Additionally, the Rep service on the Diego cell will be killed before it can send a route unregister event to route-emitter and the Gorouter.

This forced termination of app instances leads to Reactive Pruning on the Gorouter. The Gorouter observes a connection failure when it handles requests from the client in these instances. It immediately purges the route from memory and sends the request to a healthy instance. This might lead to singular 502 error responses to the client during a Diego cell drain timeout.

This behavior can leave stale routes in the Gorouter, which should be pruned automatically at regular intervals. Using TLS for application consistency in TAS Elastic Application Runtime will help prevent misrouting in these scenarios.

Resolution

Changes to the Rep process handling during Diego cell drain timeout operations will be applied in a future release of TAS/Elastic Application Runtime. These changes will clear the stale routes prior to killing the Rep process. Rep will process stale route cleanup prior to termination, which should help eliminate stale route creation if Diego cell drain operations reach timeout for any reason.

Please subscribe to this KB for updates on release versions for this fix.

Workaround if not on fixed version, try steps in order:

Increase Diego cell resources to ensure app instances can fail over onto secondary Diego cells during drain operations.
Resolve any Blobstore latency that might prevent new app instances from being deployed.
If the first two options are not possible, increase the Diego drain connection timeout (App graceful shutdown period) to more than 10 minutes.

Feedback

thumb_up Yes

thumb_down No