The NSX Edge Firewall framework drops subsequent TCP SYN packets due to a 'Failed expected state'.
search cancel

The NSX Edge Firewall framework drops subsequent TCP SYN packets due to a 'Failed expected state'.

book

Article ID: 431285

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • TCP SYN Packets Dropped with "Failed expected state" due to Half-Closed Connections in NSX Firewall.

  • Packets are being dropped intermittently between the Scheduler App and AppsManager within the VMware TAS environment, resulting in 'Connect timed out' failures.

  • High level packet flow :

    Scheduler Apps (Pod running on Diego Cell VM) ---> NSX Load Balancer VIP ---> Backend Pool Member (GoRouters).

    The Scheduler Apps are microservices running on Diego Cell VM.

    Traffic originating from the Scheduler applications is routed to the GoRouters, where it is distributed using a round-robin load-balancing algorithm. The round‑robin algorithm operates at Layer 7 (HTTP). The GoRouters maintains TCP connections with multiple instances of an application and distributes HTTP requests across these instances using a round‑robin approach.

    Refer: https://docs.cloudfoundry.org/concepts/http-routing.html#round-robin

  • The majority of sessions involving the Scheduler IPs are persisting in a FIN_WAIT state because the client is not sending a TCP FIN packet to actively close the connection. Under the current Firewall framework, the timeout for a half-closed TCP connection is 900 seconds. Refer Default Session Timer Values
    Consequently, these sessions remain in the connection table for an extended period.

    Edge> get firewall <T1_SR_Uplink_Interface> connection | find <Scheduler_App_IP>
    172.##.##.20:48378  -> 172.##.##.25:3306  dir in protocol tcp state ESTABLISHED:ESTABLISHED f-2060 n-0   
    172.##.##.22:40654  -> 172.##.##.24:3306  dir in protocol tcp state ESTABLISHED:ESTABLISHED f-2060 n-0   
    100.##.##.3:25342 (172.##.##.23:34420) -> 172.##.##.13:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 56  
    100.##.##.3:25436 (172.##.##.23:52728) -> 172.##.##.11:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 97  
    100.##.##.3:25529 (172.##.##.23:42440) -> 172.##.##.13:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 97  
    100.##.##.3:25600 (172.##.##.23:60564) -> 172.##.##.11:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 97  
    100.##.##.3:25590 (172.##.##.23:60580) -> 172.##.##.13:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 97  
    100.##.##.3:25507 (172.##.##.23:60594) -> 172.##.##.12:443 (10.##.##.129:443) dir in protocol tcp state ESTABLISHED:FIN_WAIT_2 f-2060 n-0 expire 97

     

  • The 'f-2060' points out to the Firewall Rule.

    Edge> get firewall <T1_SR_Uplink_Interface> ruleset rules
    <output omitted for brevity>
    Firewall rule count: 1
        Rule ID : 2060
        Rule : inout protocol any from any to any accept

     

  • Packet captures taken at the Scheduler App switchport confirm that the Scheduler VM never sends a TCP FIN packet.

  • Packet capture on NSX Tier-1 SR Uplink Interface shows TCP SYN collisions.

    13:55:56.144240 IP 172.##.##.2.43106 > 10.##.##.129.https: Flags [S], seq 3763703824, win 64240, options [mss 1460,sackOK,TS val 2926178674 ecr 0,nop,wscale 7], length 0
    13:55:56.188793 IP 172.##.##.2.43114 > 10.##.##.129.https: Flags [S], seq 110398818, win 64240, options [mss 1460,sackOK,TS val 2926178718 ecr 0,nop,wscale 7], length 0

     

  • Packet capture taken on the switch port of the Scheduler App (client) indicates that the source port is being reused. 



  • Multiple samples of the get firewall <T1_SR_Uplink_Interface> interface stats command, collected at different time intervals, show a consistent increase in packet drops. The simultaneous increment of both the 'Input packets dropped' and 'Failed expected state' counters confirms that the packets are indeed being dropped due to 'Failed expected state'.

    Edge> get firewall <T1_SR_Uplink_Interface> interface stats 
    <output omitted for brevity>
    Connections per second                  : 93
    Drop by IPsec policy                    : 0
    Drop by LB                              : 0
    .
    .
    Failed NAT connection limit             : 0
    Failed NAT translation                  : 0
    .
    .
    Failed expected state                   : 26319 <======================
    .
    .
    Input bytes allowed                     : 31956750139598
    Input bytes dropped                     : 1638102
    Input dropped packets copied            : 12630921
    Input encrypted packets                 : 0
    Input fastforwarded                     : 14978202844
    Input fragments dequeued                : 0
    Input fragments queued                  : 0
    Input fragments released                : 0
    Input of inactive context               : 0
    Input packets allowed                   : 31042192627
    Input packets dropped                   : 26102 <======================

         Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

This issue occurs because the client application does not transmit a TCP FIN packet to explicitly terminate the connection. Under the default Firewall configuration, a half-closed TCP connection is kept alive for 15 minutes (900 seconds). As the client sequentially increments its ephemeral source ports for new connections, a new TCP SYN packet overlaps with a residual half-closed TCP connection that still exists in the Firewall's state table. The Firewall subsequently drops the new SYN packet due to a state mismatch. This is an expected Firewall behavior. 

Resolution

Create a new session timer profile with the TCP Closing timer adjusted to a value less (say 2 minutes) than the default of 15 minutes for the Firewall to purge the half-open connection therefore allowing any new connection to be allowed by the Firewall.
The required procedural steps are officially documented within the Broadcom Administration Guide Create a Session Timer