TCP connection get reset when apps intend to reuse HTTP keep-alive connection
search cancel

TCP connection get reset when apps intend to reuse HTTP keep-alive connection

book

Article ID: 297790

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

When app reuses and intends to send requests over the connection, it gets reset immediately or fails due to time out as following error message:

Post https://login.<SYSTEM_DOMAIN>/<PATH>: read tcp IP:PORT-IP:443: read: connection reset by peer

I/O error on POST request for  "https://api.<SYSTEM_DOMAIN>/<PATH>": api.<SYSTEM_DOMAIN>:443 failed to respond; nested exception is org.apache.http.NoHttpResponseException: api.<SYSTEM_DOMAIN>:443 failed to respond

I/O error on GET request for "https://autoscale.<SYSTE_DOMAIN>/<PATH>": Read timed out; nested exception is java.net.SocketTimeoutException: Read timed out

 

Environment


Cause

When apps running on Elastic Application Runtime (EAR)  accesses other endpoint on the foundation, network routing would be like: 

On the app (in container), go to Diego Cell, select Gateway (e.g. NAT gateway), go to Load Balancer (LB). From there go to Gorouter, select Destination.

In case the app established HTTP keep-alive connection to destination, LB / NAT gateway (or other network nodes like firewall) may drop the connection when it's idle for certain time, but without sending back TCP RST to notify client(the app).

Usually LB / NAT gateway drops connection for reasons:

  1. Physical resource limit. It can only maintain certain number of concurrent connections, so it typically removes older connections mapping from memory.
  2. Customer configuration. It configured not to allow idle connection as long as certain time(e.g. 5 minutes).
  3. IaaS default settings. GCP/Azure LB has 4 minutes idle timeout by default.

Because the HTTP keep-alive connection gets dropped without TCP RST, client app regards it as alive, once app intends to send request over the dropped connection, it receives a TCP RST immediately or receives nothing(Azure LB). 

Resolution

Most LB / NAT gateway supports the feature sending TCP RST by default when drop idle connections, we recommend not disable the feature.

Azure LB did not support TCP RST when drop idle connections, and it was not configurable, please refer workaround at - Azure Networking Connection Idle for more than Four minutes. According to latest update from Azure on September 24, 2018 - Azure Load Balancer TCP resets on idle in preview, the feature to enable TCP RST is in preview. 

In the case TCP RST is not supported on any network component,  we recommend to configure as the following way:
From the Operations Manager screen, select Elastic Application Runtime, select Settings, go to Network, select Frontend Idle Timeout for Gorouter and HAProxy as less than the idle time before LB / NAT gateway drops it. By reducing the "Frontend Idle Timeout for Gorouter and HAProxy" from default 15 minutes, Gorouter could close idle connection in graceful way by sending back TCP RST before LB / NAT gateway does.