Tanzu Application Service for VMs and Misrouting
search cancel

Tanzu Application Service for VMs and Misrouting

book

Article ID: 297513

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article will cover the concept of "Misrouting" within the PAS/TAS for VMs platform. First, we'll briefly cover what this situation is, the symptoms you'll see, and their causes.

After this, we'll cover a few things that you can check as an operator to avoid, detect, and possibly fix this situation. If you have suspicions that you are experiencing misrouting or stale routes, please follow the checklist section of this KB.

Our Documentation has a good explanation as to what Misrouting is in the terms of PCF/TAS for VMs. From our Preventing Misrouting documentation:

As TAS for VMs manages and balances apps, the internal IP address and ports for app instances change. To keep the Gorouter’s routing tables current, a Route-Emitter on each Diego cell sends a periodic update to all Gorouters through NATS to remind them of the location of all app instances. The default frequency of these updates is 20 seconds. The Gorouter tracks a time-to-live (TTL) for each route to back end mapping. This TTL defaults to 120 seconds and is reset when the Gorouter receives an updated registration message.

Network partitions or NATS failures can cause the Gorouter’s routing table to fall out of sync, as TAS for VMs continues to re-create containers across hosts to keep apps running. This can lead to routing of requests to incorrect destinations.



Symptoms:
When a request is routed to an incorrect destination, this is known as Misrouting, with a symptom of this commonly being a “Stale Route”. The route is called a “Stale Route” because it missed its unregister-route message, causing it to no longer be fresh and therefore “Stale”. The "Stale Route" will continue to push traffic for that route to the wrong backend, causing HTTP responses that are un-intended.

Resolution

If you believe you are facing Misrouting issues in your foundation, please open a case with Broadcom Support and provide the following in the case attachments:
 

  1. TAS Runtime Version
  2. Gorouter Logs from your TAS foundation for all Gorouter VMs
  3. The Routing Table for all Gorouter VM (instructions)
  4. Is SSL Certificate Validation Enabled (Yes / No)
  5. Historical graph of the CC & Diego Sync KPI



Checklist:

Preventing Misrouting in PCF/TAS4VMs

 1. Verify that the foundation does not have SSL Certificate Validation Disabled:

Per the Configure Networking Docs (See Step 15):

If you are not using SSL encryption or if you are using self-signed certificates, select the Disable SSL certificate verification for this environment checkbox. Selecting this checkbox also disables SSL verification for route services and disables mutual TLS app identity verification.

Foundations will need SSL Certificate Validation enabled if the Operator desires their routes to remain as fresh as possible, so this option should be unchecked. This option exists for historical reasons only and should no longer be used. In the past, this was a way to work around the problems associated with using self-signed certificates. This feature is not meant to be used with Secure & Production foundations.

In recent versions of Ops Manager, this is better handled by checking the "Include OpsManager Root CA in Trusted Certs" box under Ops Manager -> Bosh -> Security, which will ensure that certs generated through Ops Manager are trusted. When using a non-Ops Manager certificate authority, you may simply include the root and optionally intermediate CA certificates directly into Ops Manager -> Bosh -> Security -> Trusted Certificates. This will have the same effect.

If this option cannot be disabled on the foundation, then the foundation will need to use TTL to prune routes. TTL (Time to Live) will give each route to backend a lifetime where it will be the active until its pruned at the end of its TTL period. While this does guarantee the routes will be routinely pruned, they will not be pruned immediately on a backend failure like they would be with TLS, which means the route may not properly serve traffic to the right backends for a period of time until it is pruned. Because of this, and other TTL shortcomings, TTL is unlikely to be a viable option for Business-Critical foundations. To read more on TTL, you can visit here.

2. Verify the foundation is using TLS for Application Identity:

With TLS enabled for Application Instance Identity, the foundation gains the following benefits:
 

  • Improved availability for apps by keeping routes in the Gorouter’s routing table when TTL expires
  • Increased guarantees against misrouting by validating the identity of back ends before forwarding requests
  • Increased security by encrypting data in flight from the Gorouter to back ends

When the foundation is using this Consistency Mode, it gains those major benefits. Enabling this feature is the best way to help combat Misrouting. More info on this in our documentation.

3. Monitor that the Cloud Controller(s) and the Diego System are in Sync:

Monitoring this KPI  to indicate if the cf-apps domain is up-to-date is the best way to do this. It’s a firehose metric that can be picked up by many different products and tools. Basically, if this metric ≠ 1, then the domain is out of sync and misrouting, among many other problems, may occur.

4. Check for known major Gorouter Bugs:

We really only have one incident of this, however it’s very well documented. The Knowledge Base Article has all the details and workarounds, but here’s the gist:

If an environment is running the following versions of TAS, then it is affected and should be upgraded ASAP:

  • 2.3.0 - 2.3.18
  • 2.4.0 - 2.4.13
  • 2.5.0 - 2.5.9
  • 2.6.0 - 2.6.4

If an environment is running PCF/PAS/TAS for VMs 2.7.* or greater, then they are safe from this bug.


Diagnosing Misrouting in PCF/TAS4VMs 

1. Identify potential issues by checking the Gorouter logs for failed route pruning messages:

In the Gorouter’s logs, located at /var/vcap/sys/log/gorouter/gorouter.stdout.log, if you start to see instances of prune-failed-endpoint, that may be an indication of Misrouting. Example:

{"log_level":3,"timestamp":"2020-06-30T03:51:59.729421504Z","message":"prune-failed-endpoint","source":"vcap.gorouter.registry","data":{"route-endpoint":{"ApplicationId":"###-###-####-####-########","Addr":"172.##.##.##:#####","Tags":{"component":"route-emitter"},"RouteServiceUrl":""}}}

If this is the case, verify that the Gorouters and other platform VMs are healthy via their Key Performance Indicators (KPIs) as a starting point.

2. Grab the Routing Tables from multiple Gorouters and compare them to verify routes are consistent between them:

  • If multiple Gorouters have the same bad route (meaning the route is pointing to the wrong backend), it’s possible that the Diego_Cell’s route-emitter is sending a bad route and would point to a Diego issue.
  • If just one Gorouter has the bad route, it’s more likely to be an issue with just that router, and not the system as a whole.
  • If you believe that you are experiencing a routing issue, please capture the routing tables from all your Gorouter VMs and include that information with your Support Ticket.

3. To detect if there are stale routes on Gorouter, please follow steps:

  • `bosh ssh` into any gorouter VM
  • find gorouter user/password with `head /var/vcap/jobs/gorouter/config/gorouter.yml`
  • execute command `curl -s  http://router_status:<PASSWORD>@localhost:8080/routes | jq '[to_entries[].value[] | {address: .address, private_instance_id: .private_instance_id}] | unique | group_by(.address) | map(select(length>1))'`

If the above curl/jq script returns [], there are no stale routes.

If it returns json as seen in the example below, and a single address is associated with multiple private_instance_id, there are stale routes. 

[
  [
    {
      "address": "10.###.###.###:####",
      "private_instance_id": "########-####-####-####-####"
    },
    {
      "address": "10.###.###.###:#####",
      "private_instance_id":"########-####-####-####-####"     }
  ],
  [
    {
      "address": "10.###.###.###:#####",
      "private_instance_id":"########-####-####-####-####"     },
    {
      "address": "10.###.###.###:#####",
      "private_instance_id":"########-####-####-####-####"     }
  ]
  ...
]
 

To temporarily work around the issue, a sudo monit restart gorouter on all Gorouter VMs can clean up stale routes on the respective Gorouter.
 

4. Check out our documentation on Diagnosing Stale Routes in Gorouter:

This is newer documentation, however it’s a great resource. It covers quite a bit, including the following:

  • Causes of Stale Routes
  • How to Locate Stale Routes
  • How to fix Stale Routes