BGP/BFD flaps observed on Edge node resulting in active_bgp to hmap failed errorCode
search cancel

BGP/BFD flaps observed on Edge node resulting in active_bgp to hmap failed errorCode

book

Article ID: 418109

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

BGP/BFD flapped between the edge node with uplink switches/routers with the error "Adding active_bgp to hmap" failed errorCode="EDG0200075" with screenshot of this error.

Below logs would appear during BFD flap in syslogs os NSX Edge.

####-##-##T##:##:##.###Z <Hostname of NSX Edge VM> NSX 3519 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-bfd-dp" level="INFO"] BFD event from DP - Local_IP:<LocalBFD IP Addr> Remote_IP:<Remote BFD IP Addr> State: BFD-STATE-DOWN

####-##-##T##:##:##.###Z <Hostname of NSX Edge VM> start-stop-daemon 3507 - - ####-##-##T##:##:### rcpm 3519 rcpm-active-bgp [ERROR] Adding active_bgp to hmap failed errorCode="EDG0200075"

Also BFD ring_full was observed on NSX edge nodes and ring full was observed on vnics of Edge VM carrying traffic.

Environment

3.2.2.0.0.20737193

Resolution

Based on the alarms and if BFD ring full issue is observed on NSX edge nodes, the ring size for RX/TX could be incremented following KB 330475

To troubleshoot for any NIC errors or drops on physical Nics follow KB 341594

 

Additional Information

For troubleshooting check for the presence of below alarms in syslog of NSX Edge during issue timestamp

Edge CPU usage high alarm

####-##-##T##:##:##.###Z <Hostname of NSX Edge VM> NSX 3555 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="CRITICAL" eventFeatureName="edge_health" eventType="edge_cpu_usage_very_high" eventSev="critical" eventState="On"] The CPU usage on Edge node <UUID of NSX Edge VM> has reached 98% which is at or above the very high threshold value of 80%.

Edge datapath CPU usage high alarm

####-##-##T##:##:##.###Z <Hostname of NSX Edge VM> NSX 3555 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING" eventFeatureName="edge_health" eventType="edge_datapath_cpu_high" eventSev="warning" eventState="On"] The datapath CPU usage on Edge node <UUID of NSX Edge VM> has reached 92.13% which is at or above the high threshold for at least two minutes.

Also check for any spike in cpu/mem from Support Bundle of NSX Edge VMs from the folder /var/log/vmware 

In such cases to analyze if the issue is due to high traffic, and to collect more information from NSX edges of NSX 3.2.x releases, the script in KB could be installed on NSX Edge nodes to collect dp-stats https://knowledge.broadcom.com/external/article?articleNumber=393772

Note : But dp-stats logs doesn't rollover automatically when installed manually. We should clear up dp-stats logs once in a while

Also collect ESX ADF data from host when Edge CPU load goes high to see if there are any scheduling issues.