This has been observed in TAS Elastic Application Runtime 6.0.x environments, but maybe seen in 10.x environments as well.
Any delay/latency in blobstore access, or resource contention on Diego cells that might prevent app instances from starting on a new cell within the default 10 minute timeout for Diego drain operations might cause a Diego cell drain timeout. App instances running on the Diego cell attempting to drain will be forcefully killed. Additionally, the Rep service on the Diego cell will be killed before it can send a route unregister event to route-emitter and the Gorouter.
This forced termination of app instances leads to Reactive Pruning on the Gorouter. The Gorouter observes a connection failure when it handles requests from the client in these instances. It immediately purges the route from memory and sends the request to a healthy instance. This might lead to singular 502 error responses to the client during a Diego cell drain timeout.
This behavior can leave stale routes in the Gorouter, which should be pruned automatically at regular intervals. Using TLS for application consistency in TAS Elastic Application Runtime will help prevent misrouting in these scenarios.
Changes to the Rep process handling during Diego cell drain timeout operations will be applied in a future release of TAS/Elastic Application Runtime. These changes will clear the stale routes prior to killing the Rep process. Rep will process stale route cleanup prior to termination, which should help eliminate stale route creation if Diego cell drain operations reach timeout for any reason.
Please subscribe to this KB for updates on release versions for this fix.