NSX Edge datapath service not working due to larger AppHA packets.
search cancel

NSX Edge datapath service not working due to larger AppHA packets.

book

Article ID: 345786

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • BFD tunnels on the Edge are down.
  • Edge TEP is not reachable.
  • Larger AppHA packets can be verified by checking the existence of entries like the following in syslog where it indicates the packet size (here it says 1976b) is larger than 1472b. Anything above 1472b size is a problem.
File path - /var/log# grep "AppHA-tx-Bridge" syslog
2023-02-13T23:15:12.799Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00085,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b
2023-02-13T23:15:12.803Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00086,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b
2023-02-13T23:15:13.121Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00087,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b
2023-02-13T23:15:13.160Z 10-172-23-51 NSX 17 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="db-config" level="INFO"] AppHA-tx-Bridge(00088,00000): ANNO.REQ.0000000000:0000000000,peer=0c0f0304-9315-11ed-8c11-0050568d1c4b,1976b 
  • No response to edge datapath commands like "get logical-routers" would not work, as an external symptom. 
2022-10-23T22:45:42.329Z <edge FQDN> NSX 6534 - [nsx@6876 comp="nsx-edge" subcomp="cli" username="admin" level="INFO"] CMD: get logical-routers
 
Error logged following command in /var/log/syslog on Edge:
2022-10-23T22:45:42.444603+00:00 <edge FQDN> NSX 6536 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="edge-appctl" s2comp="unixctl" level="WARN"] failed to connect to /var/run/vmware/edge/dpd.ctl 
  • dp-ipc threads in blocked state can be seen in /var/log/syslog: the blocked state keeps incrementing - For example, in the below log lines, the thread = urcu2 keeps incrementing from 4000ms to 8000ms to 16000ms. 
2022-10-23T21:51:05.725Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 4000 ms waiting for dp-ipc31 to quiesce
2022-10-23T21:51:09.724Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 8000 ms waiting for dp-ipc31 to quiesce
2022-10-23T21:51:17.725Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="urcu2" level="WARN"] blocked 16000 ms waiting for dp-ipc31 to quiesce
2022-10-23T21:51:24.979Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 1000 ms waiting for dp-ipc31 to quiesce
2022-10-23T21:51:25.978Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 2000 ms waiting for dp-ipc31 to quiesce
2022-10-23T21:51:27.978Z <Edge FQDN> NSX 4468 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 4000 ms waiting for dp-ipc31 to quiesce


Cause

Large AppHA packets which are used to exchange Bridge service HA status got on the top of the retransmit timer heap and caused the bfd thread in a busy loop to process the same AppHA packet repeatedly while taking the bfd lock. This leads the CLI to be blocked after the config thread also needs the bfd lock to process an AppHa related config.

Resolution

This issue is resolved in VMware NSX-T 3.2.3 (build number 21703624)
This issue is resolved in VMware NSX-T 4.0.2 (build number 20598727)
This issue is resolved in VMware NSX-T 4.1.1 (build number 21332673)

Workaround:
Put the affected Edge into Maintenance Mode and reboot it. 

Additional Information

This issue can be reproduced by increasing the AppHA packets of bridge service above size 1472, then toggling the Connected state of the vNics of the Edge VM in vCenter. The bridge AppHA packet size can be artificially increased by adding Transport Zones to the Edges.


Impact/Risks:
Datapath failure on the NSX Edge devices.