Gorouter Process Crash Due to Race Condition
search cancel

Gorouter Process Crash Due to Race Condition

book

Article ID: 298398

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This Knowledge Base (KB) article details a newly discovered bug which can cause the gorouter process to crash unexpectedly. 

The gorouter is a critical component responsible for routing incoming HTTP requests to the appropriate destination within the Tanzu Application Service (TAS) environment. If a gorouter unexpectedly crashes, it can lead to intermittent connection resets for clients who are sending HTTP requests into the platform.

The intermittent nature of connection resets can be attributed to the platform's architecture. Typically, gorouters are situated behind a load balancer, and the load balancer conducts periodic health checks on the gorouters to ensure their operational status as suitable backends for proxying requests. If a gorouter crashes, it may not be detected by the load balancer until the next health check, leading to continued routing of client requests to the crashed gorouter, resulting in connection resets. When only one gorouter is affected, only requests destined for it will encounter resets, while those routed to unaffected gorouters will remain uninterrupted. 

A change was introduced in Golang v1.20 which has been identified as a factor contributing to the error condition that gorouter is experiencing. 

The following Tanzu products contain a routing release that is compiled with golang v1.20+:

Tanzu Application Service for VMs 
  • 2.11.36+
  • 2.13.18+
  • 3.0.8+
  • 4.0.0+
Tanzu Isolation Segment releases
  • 2.11.30+
  • 2.13.15+
  • 3.0.8+
  • 4.0.0+

These logs found in gorouter.stderr.log can help determine if the race condition bug is responsible for a gorouter crash:
 
fatal error: concurrent map writes

goroutine 3061391056 [running]:
net/textproto.MIMEHeader.Set(...)
	/var/vcap/data/packages/golang-1.20-linux/52b7dfbcdc9e152c3a88836448f1dfe69abfc8d3/src/net/textproto/header.go:22
net/http.Header.Set(...)
	/var/vcap/data/packages/golang-1.20-linux/52b7dfbcdc9e152c3a88836448f1dfe69abfc8d3/src/net/http/header.go:40
code.cloudfoundry.org/gorouter/proxy/round_tripper.(*ErrorHandler).HandleError(0xc0096bfb00?, {0xd5f2d8, 0xc002927040}, {0xd544a0, 0xc00007c0e0})
	/var/vcap/data/compile/gorouter/src/code.cloudfoundry.org/gorouter/proxy/round_tripper/error_handler.go:53 +0x185
code.cloudfoundry.org/gorouter/proxy/round_tripper.(*roundTripper).RoundTrip(0xc0005303c0, 0xc01c402300)
	/var/vcap/data/compile/gorouter/src/code.cloudfoundry.org/gorouter/proxy/round_tripper/proxy_round_tripper.go:297 +0x3166


Environment

Product Version: 2.13

Resolution

To mitigate the impact of this issue, an effective strategy may be to reduce the polling interval between the load balancer and the gorouter. This adjustment enables quicker detection of a crashed gorouter, facilitating its prompt removal as a backend component.

The gorouter team has created a patch this issue which is bundled in routing release v0.288.0.

The following Tanzu products contain the patch:

Tanzu Application Service for VMs 
  • 2.11.52+
  • 2.13.34+
  • 4.0.16+
  • 5.0.6+
Tanzu Isolation Segment releases
  • 2.11.46+
  • 2.13.31+
  • 4.0.16+
  • 5.0.6+