Symptoms:
Total Flow Capacity: 1000000 Current Flow Entries: 543120 [...]
In the above example, there are 543,120 known flows on the NSX ESG, the capacity being at 1,000,000 flows. The figure should be compared to the expected traffic scale in the relevant environment.
123: tcp 6 2553 ESTABLISHED src=10.0.0.1 dst=10.0.0.2 sport=10001 dport=10002 pkts=0 bytes=0 src=10.0.0.2 dst=10.0.0.1 sport=10002 dport=10001 pkts=0 bytes=0 [ASSURED] mark=0 rid=0 use=1
In the above example, the TCP flow opened between 10.0.0.1:10001 and 10.0.0.2:10002 is known as ESTABLISHED, will time out in 2,553 seconds, but no traffic has been recorded.
To check the configured TCP timeout: show flowtimeouts
nf_conntrack_tcp_timeout_syn_sent = 30 nf_conntrack_tcp_timeout_syn_recv = 30 nf_conntrack_tcp_timeout_established = 21600 nf_conntrack_tcp_timeout_fin_wait = 20 nf_conntrack_tcp_timeout_close_wait = 60 nf_conntrack_tcp_timeout_last_ack = 30 nf_conntrack_tcp_timeout_time_wait = 30 nf_conntrack_tcp_timeout_close = 10 [...]
In the above example, the TCP timeout for Established flows is configured at 21,600 seconds.
To list the known flows: show flowtable
234: tcp 6 4253932 ESTABLISHED src=10.0.0.1 dst=10.0.0.3 sport=10011 dport=10013 pkts=0 bytes=0 src=10.0.0.3 dst=10.0.0.1 sport=10013 dport=10011 pkts=0 bytes=0 [ASSURED] mark=0 rid=0 use=1
In the above example, the TCP flow has a timeout of 4,253,932 seconds, so far greater than the configured TCP timeout.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
As part of the NSX Edge High Availability feature set, the synchronization of the flow table from the active appliance to the standby appliance avoids the breaking of flows in case of HA failover. Indeed, when the formerly standby appliance becomes the newly active appliance, the flows are already known, so the stateful firewall is able to match the traffic to the known flows.
The flow table synchronization leverages conntrackd. The intended behavior is for the flow table to be pushed from the active appliance to the standby appliance. The issue is introduced by an unintended bidirectional synchronization, overwriting flow status on the active appliance from the standby appliance.
In association to this synchronization issue, flows may get assigned a TCP timeout value that is greater than the configured value (TCP timeout is a decrementing value). This contributes to the growth of the flow table since flows may not time out in timely manner.
Empty flows, seen in the table with packet counter at 0, are expected in the following situations:
This issue is resolved in VMware NSX Data Center for vSphere 6.4.8
Workaround:
Disabling either of the below features, or both, removes the conntrackd synchronization:
If neither of these options are possible, and to work around this issue, contact Broadcom Support and note this Article ID in the problem description.