Following symptoms will be observed on NSX manager and Edge node during the excessive Firewall publish.
1.Edge transport node on NSX Manager UI may show as down or unknown state. However, edge connectivity to the Manager, Controller, PNIC/Bond, and Tunnel Status will remain UP.
2.Executing CLI commands on problematic edge nodes related to edge data path may fail with similar error as below example
edge>get bridge
" % An unexpected error occurred: Failed to get bridge port. The dataplane service is in error state, has failed or is disabled"
3.In syslog on edge you can see high block time for dp-ipc & Longer Firewall apply time as shown below.
yyyy-mm-ddThh:mm:ss <edge> NSX 14357 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 64000 ms waiting for dp-ipc43 to quiesce
yyyy-mm-ddThh:mm:ss <edge> NSX 14357 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewall" tname="dp-ipc43" level="INFO"] Firewall apply total: 90902 msec wait/done 0/1
Error screenshots:
NSX-T 4.2.1.x
This issue may occur if a customer's environment has more than 80K IPs realized on the edge datapath with frequent config churn in the NSX-T environment wrt Edge FW rule publish, Security Group add/delete or update operation.
Realized IP’s on Edge datapath can be tracked by using below commands.
On Live NSX-T edge Node
1.Use below command to get overall firewall rule count
edge-appctl -t /var/run/vmware/edge/dpd.ctl fw/show ruleset > fw-if-ruleset
2.Use below command to count overall IPs
For Ipv4
grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l
For Ipv6
grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l
OR
On NSX-T edge Support bundle
1.Generate Edge log support bundle to and check file edge/fw-ruleset file from edge support logs
2.Use below command to count overall IPs
For Ipv4
grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l
For Ipv6
grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l
Example
For Ipv4
root@Edge-4:~# edge-appctl -t /var/run/vmware/edge/dpd.ctl fw/show ruleset >fw-if-ruleset
root@Edge-4:~# grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l
180213
For Ipv6
root@Edge-4:~#grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l
7920
Impact:
This issue will not contribute to datapath impact however edge status on NSX-T UI will show as down/unknown state and take longer time to recover
Firewall apply time and dp-ipc thread block will be optimized in future version
Recommendation:
Broadcom recommended overall ip count on edge datapath configuration should not be greater than 80 K, when the overall ip count exceeds 80 k along with frequent config churn customer will experience this edge unknown/down issue,