[APP/PROC/WEB/0] OUT time="2019-05-07T19:56:20Z" level=error msg="[Cloud Controller - app summary] CloudController - route not found"
The Gorouter has stale routes in the route table which are pointing to the incorrect or old applications. This causes an abundance of 404 and/or 502 errors among other issues.
A "stale route" points app traffic to a Diego Cell and Port where the application is not actually running as it missed the message it was supposed to unregister, but is still serving traffic. These stale routes generate the "404 Not Found" errors. Attempting to remove the stale routes in the "Routes" section of Apps Manager or on the command line (CLI) with 'cf delete-route'
fails. Observe that if you curl
the routes, the bad route still appears.
To determine if you have stale routes:
1. Get your app’s instance address, `cf ssh` into app container, run `ip a`. If you have multiple app instances you will need to run `cf ssh <app-name> -i <n>`, with n being the app instance number.
2. `bosh ssh` into any Diego Cell and run `cfdot` with the address in this command:
$ cfdot actual-lrp-groups | jq '.instance | select(.instance_address =="10.255.13.90") | {"process_guid": .process_guid, "index": .index, "instance_guid": .instance_guid, "cell_id": .cell_id, "address": .address, "instance_address": .instance_address, "ports": .ports, "state": .state}'
3. Grab a routing table from any router: https://github.com/cloudfoundry/gorouter#the-routing-table
bosh ssh router/<guid> sudo -i /var/vcap/jobs/gorouter/bin/retrieve-local-routes
4. If your cfdot output for address and host_tls_proxy_port(under container_port:8080), doesn’t match what’s in the routing table, you have a stale route.
/var/vcap/jobs/gorouter/bin/retrieve-local-routes | jq . | grep -A 10 "appname" "appname.apps-dev.gp2.company.net": [ { "address": "192.0.2.79:61062", "tls": true, "ttl": 120, "tags": { "component": "route-emitter" }, "private_instance_id": "13b52272-7081-46f8-589f-f33d", "server_cert_domain_san": "13b52272-7081-46f8-589f-f33d" } ],
The setting which causes this behavior is "Router uses TLS to verify application identity (Default)". This known issue causes this feature to malfunction.
This occurs because the Gorouter misses an unregister message from NATS. The unregister messages are sent only once and NATS messaging is inherently prone to lost messages due to utilizing the UDP transport protocol. When the Gorouter misses this message, the route is now considered orphaned by the platform, however it's still serving traffic and is therefore "stale".
Routes are only pruned on TLS negotiation failures when TLS from the Gorouter to application instances is enabled. Refer to the following documentation for more information: https://docs.cloudfoundry.org/concepts/http-routing.html#tls-to-back-end.
See Preventing Misrouting for additional information on this intended functionality.
As mentioned in the Summary, this issue is now permanently fixed in the following TAS releases:
IMPORTANT NOTE: If you had previously upgraded TAS for VMs to a version found in the "Temporary Fix" releases, you should uncheck the checkbox located at "Ops Manager > PAS Tile > Settings > Application Containers > Prune Routes on TTL Expiry for TLS Backends
" prior to you upgrading to the TAS for VMs versions mentioned before. If you do not, you are still at risk for some other outage causing routes to be incorrectly marked as expired. See the "Temporary Fix" section for more details on this risk.
Prior to the permanent fix, a temporary fix became available in the TAS for VMs 2.3.14, 2.4.10, 2.5.6 and 2.6.1 releases.
A checkbox "PAS > Settings > Application Containers > Prune Routes on TTL Expiry for TLS Backends
" had been added to offer similar AppID functionality while the Permanent fix was being worked on.
This fix enforces a Time To Live (TTL) on all routes. If the Gorouter misses an unregistration message and fails to unregister a route, eventually the TTL will expire for the route and the Gorouter will prune the old route. This is not an ideal permanent fix because of the risk of TTLs. For example, if NATS stops sending registration messages, then the TTL on all routes will expire and the Gorouter will prune all routes. The Gorouter does check for NATS healthiness before pruning as a precaution, but there is still some risk with using TTLs. However, this low risk is worth it compared to the security vulnerability of misrouting.
To apply the Temporary fix, you would need to upgrade to any of the following TAS for VMs releases, check the added checkbox, then Apply Changes:
Unfortunately, at some point, the temporary TTL fix was removed from the underlaying routing-release, but the check box was not removed from the tile.
In the following versions, the tile looks like you can configure the temporary TTL fix, but in reality, nothing is being configured:
Temporary Workaround (if you're tight on time)
As a temporary workaround (if an immediate upgrade is not possible), please restart the Gorouter(s) to remove the stale routes.
This will let the Gorouter clear the route table and repopulate the table with the correct routes sent over by CAPI. In order to gracefully drain and restart each node while maintaining router availability, use the command 'bosh restart router'
. Restarting the Gorouter(s) otherwise may cause traffic to drop.