Users browsing Web via WSS experiencing random slowness
Users sometimes getting page not found errors, and F12 developer tools output highlights Connection errors on occasions
Random new TCP SYN requests into WSS were not being answered as the TCP connection was still in CLOSING state, specifically in the FIN_WAIT_2 state.
WSS goes into the FIN_WAIT_2 state after WSS Proxy sends the client the TCP FIN, which client ACKs. The client should in turn send a TCP FIN to WSS Proxy which would then be ACKed, causing us to go from FIN_WAIT_2 state to the TIMEWAIT where we wait 30 seconds before freeing up resources (see TCP state machine)
Due to a rogue client(s) or an issue with NAT Firewall, the WSS Proxy didn't move from the FIN_WAIT_2 state for an hour causing TCP ports to get hogged even though no data was going across the session. Here's an example of such a session - note that the PCAP was only capturing TCP packets with SYN and FIN states to follow open and close requests:
- packet 30359 is the FIN from the Proxy towards the client, which does not get a corresponding TCP FIN in the other direction
Looking at the state of the TCP connection table on the Proxy (being dumped every 60 secs), we can see that it is stuck in FIN_WAIT_2 state throughout this time, and therefor cannot handle any SYN requests generated with this source TCP port.
$ grep 30202 neil-https-ggblo-dp4-1-tcp-conncs*
neil-https-ggblo-dp4-1-tcp-conncs.out:tcp4 0 0 192.168.4.83.8084 10.230.2.245.30202 FIN_WAIT_2
neil-https-ggblo-dp4-1-tcp-conncs2.out:tcp4 0 0 192.168.4.83.8084 10.230.2.245.30202 FIN_WAIT_2
neil-https-ggblo-dp4-1-tcp-conncs3.out:tcp4 0 0 192.168.4.83.8084 10.230.2.245.30202 FIN_WAIT_2
neil-https-ggblo-dp4-1-tcp-conncs4.out:tcp4 0 0 192.168.4.83.8084 10.230.2.245.30202 FIN_WAIT_2
Needed to change timeout on the WSS Proxy to kill connections in FIN_WAIT_2 state faster, and work around client side issues closing TCP connections.
Proxy Forwarding integration with WSS
About 5k users going into WSS via on premise ProxySG, using one egress IP address
Proxy Forwarding best practice followed with the two key parameters set:
- SGOS#(config)tcp-ip inet-lowport 1024
-SGOS#(config)tcp-ip tcp-randomize-port enable
Requests for TCP 8080, 8443 and 8084 being sent to 3 different pods in the data center connected to
TCP 8084 generating the majority of the traffic
Reduced FIN_WAIT_2 timeout parameter on the WSS Proxy. There is no specific configurable timeout but a code change was made to reduce it to 5 minutes.
Pushed change out globally to all WSS sites.
On Premise Proxy reporting retransmission errors regularly, which coincides with the time user reports come in as shown below