The issue is described in TAS and Isolation Segment release notes:
Note: This version of TAS for VMs contains a known issue with Gorouter error handling for backend app requests. Failures that previously returned HTTP Status Codes 496, 499, 503, 525, or 526 may instead return 502. Additionally, stale routes may fail to be pruned properly, which could result in apps unexpectedly returning HTTP Status Code 502.
The impact to production systems is most likely an increase in 502s in the logs. Most of the 502s will be a result of the other error codes now being presented as 502s, which is annoying but not a disaster. There is an edge case where the count of 'bad' routes increases due to the lack of pruning . If there are many unpruned routes, 502s presented to end users is possible. The go-router will retry backends based on configuration settings, and once it's out of retries the 502 would be presented to the user. This is bad. The mitigation for unpruned routes is to do a rolling restart of go-routers.
This is caused by a Golang migration from v1.19 to 1.20, as the result Gorouter is now only returning HTTP 502s when backend apps have certificate problems. Previously it would return different messages, and set the X-CF-Router-Error header with useful information regarding what type of certificate problem was encountered, and clean up stale route accordingly.
Impacted TAS and Isolation Segment releases as below:
Engineering has released routing-release v0.266.0*** to address the issue, but it will take time for next TAS and Isolation Segment minor version to include the fixed routing-release.
Workaround until fix release available
1. SSH into the Ops Manager VM. For more information, refer to Logging Into Ops Manager VMs with SSH.
2. Download the patched routing 0.266.0 releases to the Ops Manager VM:
sudo -u tempest-web wget -P /var/tempest/releases/ https://github.com/cloudfoundry/routing-release/releases/download/v0.266.0/routing-0.266.0.tgz
3. Find the file paths of the YAML files that define all the versions of the TAS tile in your library; you want the .yml file from the following command It should look something like:
(for TAS) sudo grep -l "^name: cf" /var/tempest/workspaces/default/metadata/* (for Isolation Segment) sudo grep -l "^name: p-isolation-segment" /var/tempest/workspaces/default/metadata/*
4. Confirm the tile version you’re using with the following command on each full file path; if there’s more than one file returned by the above, run it on each to identify the version that you have currently deployed, which you’ll need to edit in next steps.
sudo head FULL-FILE-PATH
5. Make a backup of this YAML file, into your home directory. You can restore this backup over the file you’re about to edit in order to revert the workaround if needed later.
sudo cp FULL-FILE-PATH ~ubuntu/
6. Edit the YAML file (using “sudo editor-of-choice”, such as “emacs”, “vi”, or “nano”) , make the following changes for routing release:
before change
- name: routing version: 0.259.0 file: routing-0.259.0-ubuntu-xenial-621.448.tgz exported_from: - os: ubuntu-xenial version: '621.448'
after change
- name: routing version: 0.266.0 file: routing-0.266.0.tgz
7. Apply Changes to the modified tile.
Note: route_registrar
job depends on routing-release
, thus the deployment will trigger update of all instances with route_registrar
job. Diego cells won’t be updated.