Understanding CDN health checks of Gorouter during recreates
search cancel

Understanding CDN health checks of Gorouter during recreates

book

Article ID: 298220

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When using a CDN to check the health of routers inside of TAS, depending on the polling frequency of the health endpoint, timing may not allow the CDN to correctly identify when a router is down for a restart / recreate / etc.

A content delivery network (CDN) refers to a geographically distributed group of servers which work together to provide fast delivery of internet content. CDNs such as Akamai, Cloud Flare, and others, allow for the quick transfer of assets needed for loading content, including HTML pages, javascript files, stylesheets, images, and videos, etc. These CDNs will make frequent health checks to endpoints to know whether a route is able to accept traffic. Depending on the frequency of these health checks, a CDN may not be aware that a router or path is in an unhealthy state.


Symptoms:
When upgrading, restarting, or recreating a Gorouter, the health endpoint on said router will go offline for 20 seconds during the recreate. During this time period it will be unable to respond to health checks. If your CDN is configured to perform these health checks and if it is polling too infrequently, it will likely not be aware of an unhealthy router. In situations like this, you will observe on your load balancer that traffic directed at Gorouter will still be forwarded where it ultimately times out with a message like the one below or a similar error and HTTP response code:
connection reset by peer

Environment

Product Version: 2.10

Cause

This connection reset by peer is an expected behavior while the router is offline. This is because while the router is restarting, the router will return Service Unavailable responses for load balancer health checks for 20 seconds while the routing table is being preloaded. This value can be modified, but by default it is set to 20 seconds. However, if your CDN polls more than 20 seconds apart it likely will not be able to recognize that a route is down or unavailable. This, in turn, can cause valid traffic to not be correctly rerouted by the CDN to a healthy endpoint.

Resolution

The solution is to increase the frequency of the polling of the Gorouter's health endpoint (http://GOROUTER_IP:8080/health) to better match the time the router will return Service Unavailable. Since most health checks require two failures to mark an endpoint as down, the frequency of the health check should be frequent enough to allow for two fails in the time allotted but not so short as to over saturate the network with traffic destined for health checks.