BGP/IPSEC connections flapping due to high cpu utilization on edges
search cancel

BGP/IPSEC connections flapping due to high cpu utilization on edges

book

Article ID: 314388

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

  • BGP/IPSEC connections flapping every 3-5 minutes
  • BGP down errors

2024-02-28T01:03:45.964Z my-edge-03 NSX 23431 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxxxxxxxxxxxxxxxx" tid="23462" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bgp_down"] In Router xxxxxxxxxxxxxxxxxxxxxx, BGP neighbor xxxxxxxxxxxxxxxxxxxxxx (x.x.x.x) is down, reason: Network or config error.

  • Ring buffer flow errors

2024-02-27T13:44:53.582Z my-edge-03 NSX 30827 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxxxxxxxxxxxxxxxx" tid="30978" level="WARNING" eventState="On" eventFeatureName="edge_health" eventSev="warning" eventType="edge_nic_out_of_receive_buffer"] Edge NIC fp-eth0 receive ring buffer has overflowed by 53.544926% on Edge node xxxxxxxxxxxxxxxxxxxxxxxxxx. The missed packet count is 2080328 and processed packet count is 1804873.

  • High SNAT rules hits (less ./edge/fw-ruleset | grep "[0-9][0-9][0-9][0-9][0-9][0-9] hits")

            "snat-stat": "rule xxxxxxxxxxxxx: 200369777 evals, 35940 active-sessions, in 612532 out 29184486 pkts, in 74741725 out 15795780431 bytes, 199432890 hits;", 

  • Same SNAT rules are configured with one snat IP only (in the example below, only 172.16.0.1 is configured)

/edge/fw-if-ruleset:1812: "snat": "rule xxxxxxxxxxxxxx at 1 out protocol any prenat from ip 192.168.1.0/24 to any snat ip 172.16.0.1; ",
./edge/fw-if-ruleset:1813: "snat-stat": "rule xxxxxxxxxxxxxx: 200369777 evals, 38215 active-sessions, in 612532 out 29184486 pkts, in 74741725 out 15795780431 bytes, 199432890 hits;",

  • get dataplane cpu stats show 100% usage on all cores very regularly

Tue Feb 27 2024 UTC 09:32:28.997
CPU Usage
Core : 0
Crypto : 0 pps
Intercore : 0 pps
Kni : 0 pps
Rx : 960 pps 
Slowpath : 10 pps
Tx : 420 pps
Usage : 100% -----> high CPU


Environment

VMware NSX-T Data Center

Cause

SNAT port exhaustion is the primary cause of the issue

Resolution

Workaround 1: For the SNAT rule(s) where only one SNAT IP is configured, add a few more to balance the traffic.
Workaround 2 : Reduce the traffic hitting the SNAT rule(s)
Workaround 3: Move the edge(s) to XL size