NSX-T Edge shows unknown/down state on NSX UI and take longer time to recover
search cancel

NSX-T Edge shows unknown/down state on NSX UI and take longer time to recover

book

Article ID: 396621

calendar_today

Updated On:

Products

VMware vDefend Firewall

Issue/Introduction

Following symptoms will be observed on NSX manager and Edge node during the excessive Firewall publish.

1.Edge transport node on NSX Manager UI may show as down or unknown state. However, edge connectivity to the Manager, Controller, PNIC/Bond, and Tunnel Status will remain UP.

2.Executing CLI commands on problematic edge nodes  related to edge data path may fail with similar error as below example

edge>get bridge

" % An unexpected error occurred: Failed to get bridge port. The dataplane service is in error state, has failed or is disabled"

3.In syslog on edge  you can see high block time for dp-ipc & Longer Firewall apply time as shown below.

yyyy-mm-ddThh:mm:ss <edge> NSX 14357 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 64000 ms waiting for dp-ipc43 to quiesce

yyyy-mm-ddThh:mm:ss <edge> NSX 14357 FIREWALL [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="firewall" tname="dp-ipc43" level="INFO"] Firewall apply total: 90902 msec wait/done 0/1

 

Error screenshots:

Environment

NSX-T 4.2.1.x

Cause

This issue may occur if a customer's environment has more than 80K IPs realized on the edge datapath with frequent config churn in the NSX-T environment wrt Edge FW rule publish, Security Group add/delete or update operation. 

Realized IP’s on Edge datapath can be tracked by using below commands.

On Live NSX-T edge Node

1.Use below command to get overall firewall rule count

edge-appctl -t /var/run/vmware/edge/dpd.ctl fw/show ruleset > fw-if-ruleset

2.Use below command to count overall IPs

For Ipv4

grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l

For Ipv6

grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l

OR

On NSX-T edge Support bundle

1.Generate Edge log support bundle to and check file edge/fw-ruleset file from edge support logs

2.Use below command to count overall IPs

For Ipv4

grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l

For Ipv6

grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l


Example

For Ipv4

root@Edge-4:~# edge-appctl -t /var/run/vmware/edge/dpd.ctl fw/show ruleset >fw-if-ruleset
root@Edge-4:~# grep -oP 'ip \K[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' fw-if-ruleset | wc -l
180213

For Ipv6

root@Edge-4:~#grep -oP 'ip \K([0-9a-fA-F:]+(?=(/|[\s])))' fw-if-ruleset | wc -l

7920


Impact:

This issue will not contribute to datapath impact however edge status on NSX-T UI will show as down/unknown state and take longer  time to recover

Resolution

Firewall apply time and dp-ipc thread block will be optimized in future version

Additional Information

Recommendation:

Broadcom recommended overall ip count on edge datapath configuration should  not be  greater than 80 K, when the overall  ip count exceeds 80 k along with frequent config churn customer will experience this edge unknown/down issue,

  • Optimize the environment by reducing the overall IP count up to 80K on Edge  and reduce the frequent firewall config churn
  • Delete unused NS-Group/IP-sets to reduce overall IP count