Some applications may observe timeouts or connection reset errors when sending and receiving data from within a container on high latency networks. We can explain in more detail with the following example, however there are likely to be more scenarios with similar but different symptoms. We will highlight the import symptoms which can help enable you to determine if this KB is a match to the issue your team is experiencing.
Example ScenarioIn this example, we have an application that is sending post request data to an external resource from a Linux Diego cell running on ubuntu Xenial Stemcell version 621.x. In this case, the application will send a post request to the external resource with a payload of 1MB and the request flow will look like this:
Application Container -> Diego Cell -> Firewall -> NAT -> External ResourceAs per the above call flow, the application is sending a
POST request to some external resource and has to traverse through two network devices before reaching its destination. Here are the symptoms we observe when running a
tcpdump:
Column names for reference
Frame, Timestamp, SRC-IP, DST-IP, Length, Sequence, Acknowledgment
Frame 68 sends 2554 bytes of data to the external resource. The Diego cell TCP protocol will expect to receive an
Acknowledgment packet with sequence number 102222:
68 2020-12-15 22:41:40.266698 DiegoCell-IP External-Resource-IP 2554 99734 4501
About 100ms later, the
Acknowledgment is received in frame 114. During that 100ms delay in receiving the
ack, a lot more data has been sent to the external resource and acknowledged. Other acknowledgments have been received with a sequence number higher than 102222:
114 2020-12-15 22:41:40.378925 External-Resource-IP DiegoCell-IP 66 4501 102222
The Diego Cell returns a
TCP RESET-REPLY to this packet and continues to send data to the external resource. This means the Diego cell sends this reset without an acknowledgment flag set and does not abort the TCP connection. For more information regarding the difference between a
RESET-REPLY and a R
ESET-ABORT, see
this blog post. Keep in mind this reset is not coming from the application container network interface, it is coming from the Diego cell network interface only.
115 2020-12-15 22:41:40.378950 DiegoCell-IP External-Resource-IP 54 102222 0
The reason why the Diego cell has reset this acknowledgment is because of the way IP conntrack handles the inbound TCP packets. There is a security feature within IP conntrack that inspects the packet to ensure the arrive frame is valid and can be forwarded along the normal chain. In this case, because the
ack arrived late and there was lots more data transmitted during this time, IP conntrack determined that the packet is invalid and issues a reset.
The firewall receives the reset and passes it up through the NAT which resets the tcp session between the NAT and the external resource. In this case, the firewall keeps the connection open between the Diego cell which makes the application think the external resources is receiving all the data it has sent. But the external resource will never reply and eventually the application will timeout the request and fail.
Troubleshooting tip
One simple test you can perform to see if you are experiencing this issue is to construct a matching request, using
curl, when the application is performing and execute the request when ssh'ed into the application container. As well as a separate test when ssh'ed into the Diego cell that the application container is running on. If the test from the Diego cell works, but the test from the application container fails, then you are likely experiencing this issue.