This article will cover the concept of "Misrouting" within the PAS/TAS for VMs platform. First, we'll briefly cover what this situation is, the symptoms you'll see, and their causes.
After this, we'll cover a few things that you can check as an operator to avoid, detect, and possibly fix this situation. If you have suspicions that you are experiencing misrouting or stale routes, please follow the checklist section of this KB.
Our Documentation has a good explanation as to what Misrouting is in the terms of PCF/TAS for VMs. From our Preventing Misrouting documentation:
As TAS for VMs manages and balances apps, the internal IP address and ports for app instances change. To keep the Gorouter’s routing tables current, a Route-Emitter on each Diego cell sends a periodic update to all Gorouters through NATS to remind them of the location of all app instances. The default frequency of these updates is 20 seconds. The Gorouter tracks a time-to-live (TTL) for each route to back end mapping. This TTL defaults to 120 seconds and is reset when the Gorouter receives an updated registration message.
Network partitions or NATS failures can cause the Gorouter’s routing table to fall out of sync, as TAS for VMs continues to re-create containers across hosts to keep apps running. This can lead to routing of requests to incorrect destinations.
unregister-route
message, causing it to no longer be fresh and therefore “Stale”. The "Stale Route" will continue to push traffic for that route to the wrong backend, causing HTTP responses that are un-intended.If you believe you are facing Misrouting issues in your foundation, please open a case with Broadcom Support and provide the following in the case attachments:
Checklist:
1. Verify that the foundation does not have SSL Certificate Validation Disabled:
Per the Configure Networking Docs (See Step 15):
If you are not using SSL encryption or if you are using self-signed certificates, select the Disable SSL certificate verification for this environment checkbox. Selecting this checkbox also disables SSL verification for route services and disables mutual TLS app identity verification.
Foundations will need SSL Certificate Validation enabled if the Operator desires their routes to remain as fresh as possible, so this option should be unchecked. This option exists for historical reasons only and should no longer be used. In the past, this was a way to work around the problems associated with using self-signed certificates. This feature is not meant to be used with Secure & Production foundations.
In recent versions of Ops Manager, this is better handled by checking the "Include OpsManager Root CA in Trusted Certs" box under Ops Manager -> Bosh -> Security, which will ensure that certs generated through Ops Manager are trusted. When using a non-Ops Manager certificate authority, you may simply include the root and optionally intermediate CA certificates directly into Ops Manager -> Bosh -> Security -> Trusted Certificates. This will have the same effect.
If this option cannot be disabled on the foundation, then the foundation will need to use TTL to prune routes. TTL (Time to Live) will give each route to backend a lifetime where it will be the active until its pruned at the end of its TTL period. While this does guarantee the routes will be routinely pruned, they will not be pruned immediately on a backend failure like they would be with TLS, which means the route may not properly serve traffic to the right backends for a period of time until it is pruned. Because of this, and other TTL shortcomings, TTL is unlikely to be a viable option for Business-Critical foundations. To read more on TTL, you can visit here.
2. Verify the foundation is using TLS for Application Identity:
With TLS enabled for Application Instance Identity, the foundation gains the following benefits:
When the foundation is using this Consistency Mode, it gains those major benefits. Enabling this feature is the best way to help combat Misrouting. More info on this in our documentation.
3. Monitor that the Cloud Controller(s) and the Diego System are in Sync:
Monitoring this KPI to indicate if the cf-apps
domain is up-to-date is the best way to do this. It’s a firehose metric that can be picked up by many different products and tools. Basically, if this metric ≠ 1
, then the domain is out of sync and misrouting, among many other problems, may occur.
4. Check for known major Gorouter Bugs:
We really only have one incident of this, however it’s very well documented. The Knowledge Base Article has all the details and workarounds, but here’s the gist:
If an environment is running the following versions of TAS, then it is affected and should be upgraded ASAP:
If an environment is running PCF/PAS/TAS for VMs 2.7.* or greater, then they are safe from this bug.
In the Gorouter’s logs, located at /var/vcap/sys/log/gorouter/gorouter.stdout.log
, if you start to see instances of prune-failed-endpoint
, that may be an indication of Misrouting. Example:
{"log_level":3,"timestamp":"2020-06-30T03:51:59.729421504Z","message":"prune-failed-endpoint","source":"vcap.gorouter.registry","data":{"route-endpoint":{"ApplicationId":"###-###-####-####-########","Addr":"172.##.##.##:#####","Tags":{"component":"route-emitter"},"RouteServiceUrl":""}}}
If this is the case, verify that the Gorouters and other platform VMs are healthy via their Key Performance Indicators (KPIs) as a starting point.
2. Grab the Routing Tables from multiple Gorouters and compare them to verify routes are consistent between them:
route-emitter
is sending a bad route and would point to a Diego issue.3. To detect if there are stale routes on Gorouter, please follow steps:
If the above curl/jq script returns [], there are no stale routes.
If it returns json as seen in the example below, and a single address is associated with multiple private_instance_id, there are stale routes.
[ [ { "address": "10.###.###.###:####", "private_instance_id": "########-####-####-####-####" }, { "address": "10.###.###.###:#####", "private_instance_id":"########-####-####-####-####" } ], [ { "address": "10.###.###.###:#####", "private_instance_id":"########-####-####-####-####" }, { "address": "10.###.###.###:#####", "private_instance_id":"########-####-####-####-####" } ] ... ]
To temporarily work around the issue, a sudo monit restart gorouter on all Gorouter VMs can clean up stale routes on the respective Gorouter.
4. Check out our documentation on Diagnosing Stale Routes in Gorouter:
This is newer documentation, however it’s a great resource. It covers quite a bit, including the following: