NSX-T Bare Metal edge node dataplane services experience impact, such as VPN, LB traffic, BGP and tunnels
search cancel

NSX-T Bare Metal edge node dataplane services experience impact, such as VPN, LB traffic, BGP and tunnels

book

Article ID: 317762

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • BGP session on Tier 0 gateway, on a Bare Metal (BM) edge node are down.
  • TEP (Tunnel Endpoint) tunnels are down to other Transport Nodes (TN's).
  • Data flowing through the edge node is impacted.
  • Load balancer(s) on gateways that use the edge are impacted.
  • VPN session may show "IPSec negotiation not started" or "Peer not responding".
  • On the edge node '/var/log/syslog' we see the following log entries:
2023-01-10T10:25:15.107Z ######.####.local NSX 5061 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="intel-rte" level="WARN"] KNI: Out of memory
2023-01-10T10:25:15.291Z ######.####.local NSX 5061 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="stats" level="INFO"] mempool exhausted, usage: 100, threshold: 85, pool: mbuf_pool_socket_0
  • We see no rx misses/rx_nombufs errors on physical ports of the edge node, repeat below for each physical nic:
get  physical-port <interface-name> stats
...
       NAME              : fp-eth0
       RX_MISSES         : 0
       RX_NOMBUFS        : 0
...
  • Packets per second through the edge node is very low:
get dataplane cpu stats | find Rx
"rx": "10 pps",
      "rx": "0 pps",
      "rx": "0 pps",
      "rx": "10 pps",
      "rx": "10 pps",
 

Running the top command as root user on the edge node shows the Load Balancer (LB) and KNI (Kernel NIC Interface) services are polling at 100%:

PID      USER     PR NI   VIRT       RES     SHR    S  %CPU     %MEM      TIME+    TGID  COMMAND
56730    root     20 0    65.498g    172964  55264  R  1572     0.1       16371:58 56730 /opt/vmware/nsx-edge/sbin/datapathd --no-chdir --unixctl=/var/run/vmware/edge/dpd.ctl --pidfile=/var/run+
56782    lb      20 0     1436604    48308   44400  R  100.0    0.0       1019:05  56782 /opt/vmware/nsx-edge/sbin/lb-dispatcher --no-chdir --pidfile=/var/run/vmware/edge/dispatcher.pid -vconso+
57835    root    20 0     0          0       0      R  100.0    0.0       1019:02 57835 [kni_single]



Environment

VMware NSX-T Data Center 3.x
VMware NSX 4.x

Cause

Since NSX-T 3.2.0 the core 0 is now used for KNI and LB requests, prior to this is was reserved for control priority queue only. Due to an issue, when these service (KNI and LB) where using the core 0, packet loss was introduced.

Resolution

This issue is resolved in NSX-T data center 3.2.2 and VMware NSX 4.0.0.1

Workaround:
You can either restart the impacted edge node or log in as admin to the edge node and restart the dataplane service using the following command:

restart service dataplane

Data traversing the edge node will be interrupted until the service restart completes.