There have been some datapath issues with the TCP stack post upgrade to v31.1.1.
The symptoms include:
High SE connection memory usage being reported even for SEs with no VSs placed on it.
Client requests failing with "Timed out waiting for HTTP request from client".
TCP and HTTP Health monitors failing from SEs.
Connections failing on L4 TCP and L7 VSs.
Application slowness with high data transfer times.
Virtual Service Down due to Health Monitor Down with Error: "Address in use/unavailable Status"
Environment
All Avi deployments upgraded to 31.1.1 are vulnerable to these issues.
Cause
The issues have been identified to be caused because of a compiler change with the Ubuntu version upgrade on 31.1.1.
The slowness and high SE memory issues are attributed to a buildup of TCP connections in Timed Wait state.
This can be identified from the SE mallocstats where the M_TCPTW is very large in size and does not go down.
We can also list the tcp-flows on the SEs and see that there are a large number of connections which have not been freed.
Connections failures reported for L4 TCP and L7 VSs are attributed to the SYN cache table not being cleared out.
This can again be identified from the mallocstats if the M_SYNCACHE is very large.
Because of the bug, every connection creates an entry in the SE SYN cache table, but that entry does not get cleared on timeout.
How to identify if you're running into these issues:
Verify if M_TCPTW or M_SYNCACHE are showing large values.
1) Login to the CLI.
2) Execute:
[admin:cntlr]:> show serviceengine <se-name> mallocstats
A sample output with high M_TCPTW and M_SYNCACHE respectively:
For failing connections, check the VS logs. On the UI navigate to the VS and switch to the Logs tab.
You should see logs with the significance "Connection closed abnormally: timed out waiting for HTTP request from client".
To verify if TCP connections are stuck in Timed Wait state, you can list the tcp flows on the SE using the command:
1) Login to the CLI.
2) Execute:
[admin:cntlr]:> show serviceengine <se-name> tcp-flows | grep "None" | wc -l
In the listed tcp-flows, the state "None" signifies that the sockets are in Timed Wait state. If there are a lot of such connections, then you are running into this issue. Do not run the tcp-flows command very frequently.
Note: You may not see ALL the above symptoms but only a subset of these.
Resolution
Please find the Bug details below:
AV-226288: Client requests may fail with the error Timed out waiting for HTTP request from client.
AV-232608: High Service Engine memory and intermittent connection failures due to stale TCP connections in Time Wait state.
These have been fixed in v31.1.1-2p2 which is released and available for download on the support portal.