CVE-2023-20882 - routing-release 0.262.0 prunes all routes to an application, resulting in 503 errors
search cancel

CVE-2023-20882 - routing-release 0.262.0 prunes all routes to an application, resulting in 503 errors

book

Article ID: 298140

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

TAS environment experiences many 503 errors with x_cf_routererror:"no_endpoints" even though all of the apps appear to up and functional without error do to CVE-2023-20882. There is an entry in the route table for the desired route, but there are no healthy endpoints available.

This is caused by a change introduced in routing-release 0.262.0 to enable Gorouter to retry more types of idempotent requests to failed backends.

Under specific circumstances when gorouter queries a backend app and the client connection is closed prematurely the router erroneously prunes the route to the app.
  • gorouter becomes unable to talk to an app backend, fails, and prunes the backend from the routing pool.
  • gorouter retries this up to 2 more times on different backends, suffering the same problem.
  • gorouter returns a 499 to the client.
  • In total 3 backends are pruned for a single request. if there are no remaining backends, subsequent requests return a 503 until route-emitter refreshes the routing data for gorouter.

Affected TAS versions are 2.11.37, 2.13.19, and 3.0.9
Affected IST versions are 2.11.31, 2.13.16, and 3.0.9
 

How to detect if your app has experienced this bug

The following command can be run on the gorouter log file to check for possible occurrences.   What you need to look for is when data.error has a value of "context canceled" followed by a prune-failed-endpoint error.  
find . -name "gorouter.stdout.log*" | while read line; do grep backend-endpoint-failed $line | jq -r '. | select(.data.error | contains("context canceled")) | .data.vcap_request_id'; done

Here is an example from from a gorouter log bundle collected from opsmanager.  we find a vcap id of a failed request that meets the criteria using the above command.
find . -name "gorouter.stdout.log*" | while read line; do grep backend-endpoint-failed $line | jq -r '. | select(.data.error | contains("context canceled")) | .data.vcap_request_id'; done | head -1

27116dd3-f047-4a35-7873-e9ef7e1d3f71
then we find the log line that has the application ID
 find . -name "gorouter.stdout.log*" | xargs egrep  -Hn 27116dd3-f047-4a35-7873-e9ef7e1d3f71

./router.XXXXXXXX-XXXX-XXXX-XXXX-543579d74ed0.2023-05-05-18-05-52/gorouter/gorouter.stdout.log:192:{"log_level":3,"timestamp":"2023-05-04T19:38:42.838473790Z","message":"backend-endpoint-failed","source":"vcap.gorouter","data":{"route-endpoint":{"ApplicationId":"d45e4b57-3420-40b3-b13d-9ef0562d58c5",REDACTED,"RouteServiceUrl":""},"error":"incomplete request (context canceled)","attempt":1,"vcap_request_id":"27116dd3-f047-4a35-7873-e9ef7e1d3f71","retriable":true,"num-endpoints":1,"got-connection":false,"wrote-headers":false,"conn-reused":false,"dns-lookup-time":0,"dial-time":0,"tls-handshake-time":0}}

and verify the endpoint was pruned as a result of this fault
egrep -A5 -Hn 27116dd3-f047-4a35-7873-e9ef7e1d3f71 ./router.XXXXXXXX-XXXX-XXXX-XXXX-543579d74ed0.2023-05-05-18-05-52/gorouter/gorouter.stdout.log | egrep "prune-failed-endpoint|d45e4b57-3420-40b3-b13d-9ef0562d58c5" | egrep prune-failed-endpoint

./router.XXXXXXXX-XXXX-XXXX-XXXX-543579d74ed0.2023-05-05-18-05-52/gorouter/gorouter.stdout.log-193-{"log_level":3,"timestamp":"2023-05-04T19:38:42.838565797Z","message":"prune-failed-endpoint","source":"vcap.gorouter.registry","data":{"route-endpoint":{"ApplicationId":"d45e4b57-3420-40b3-b13d-9ef0562d58c5",REDACTED,"process_instance_id":"2ea1596c-a745-4fdc-53a4-d885","process_type":"web","source_id":"d45e4b57-3420-40b3-b13d-9ef0562d58c5",REDACTED,"RouteServiceUrl":""}}}






Environment

Product Version: 3.0

Resolution

Customers hitting this bug can adapt to the bug and minimize its impact by increasing the number of application instances. Though depending on the frequency of these errors scaling the instance count may not help.

To eliminate the bug, they must get off the routing-release 0.262.0

Option 1) Wait for new routing-release 0.266.0 to be incorporated into the latest TAS release. These new releases are expected in the mid-May 2023 time frame.

Option 2) Manually update the routing-release, either to 0.261.0 or 0.266.0, using the procedure in this article: https://knowledge.broadcom.com/external/article?articleNumber=293785