TCP datapath issues post upgrade to v31.1.1 -
search cancel

TCP datapath issues post upgrade to v31.1.1 -

book

Article ID: 394524

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

  • There have been some datapath issues with the TCP stack post upgrade to v31.1.1.
  • The symptoms include:
    • High SE connection memory usage being reported even for SEs with no VSs placed on it.
    • Client requests failing with "Timed out waiting for HTTP request from client".
    • TCP and HTTP Health monitors failing from SEs.
    • Connections failing on L4 TCP and L7 VSs.
    • Application slowness with high data transfer times.
    • Virtual Service Down due to Health Monitor Down with Error: "Address in use/unavailable Status"

Environment

  • All Avi deployments upgraded to 31.1.1 are vulnerable to these issues. 

Cause

  • The issues have been identified to be caused because of a compiler change with the Ubuntu version upgrade on 31.1.1.
  • The slowness and high SE memory issues are attributed to a buildup of TCP connections in Timed Wait state. 
  • This can be identified from the SE mallocstats where the M_TCPTW is very large in size and does not go down.
  • We can also list the tcp-flows on the SEs and see that there are a large number of connections which have not been freed.
  • Connections failures reported for L4 TCP and L7 VSs are attributed to the SYN cache table not being cleared out. 
  • This can again be identified from the mallocstats if the M_SYNCACHE is very large.
  • Because of the bug, every connection creates an entry in the SE SYN cache table, but that entry does not get cleared on timeout.
  • How to identify if you're running into these issues:
    • Verify if M_TCPTW or M_SYNCACHE are showing large values.
      1) Login to the CLI.
      2) Execute:
      [admin:cntlr]:> show serviceengine <se-name> mallocstats
    • A sample output with high M_TCPTW and M_SYNCACHE respectively:

    • For failing connections, check the VS logs. On the UI navigate to the VS and switch to the Logs tab. 
    • You should see logs with the significance "Connection closed abnormally: timed out waiting for HTTP request from client".
    • To verify if TCP connections are stuck in Timed Wait state, you can list the tcp flows on the SE using the command:
      1) Login to the CLI.
      2) Execute: 
      [admin:cntlr]:> show serviceengine <se-name> tcp-flows | grep "None" | wc -l
    • In the listed tcp-flows, the state "None" signifies that the sockets are in Timed Wait state. If there are a lot of such connections, then you are running into this issue. Do not run the tcp-flows command very frequently.
  • Note: You may not see ALL the above symptoms but only a subset of these.

 

 

 

Resolution