BGP peering goes down due to a process crash or hang
search cancel

BGP peering goes down due to a process crash or hang

book

Article ID: 378429

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • VMware NSX
  • Some T0/VRF BGP peering is down unexpectedly while other BGP peering may be UP
  • NSX Manager reports a BGP down alarm
    <Date>T<Time>Z <Manager hostname> NSX 5443 MONITORING [nsx@6876 alarmId="<Alarm ID>" alarmState="OPEN" comp="nsx-manager" entId="<Ent ID>" errorCode="MP701099" eventFeatureName="routing" eventSev="HIGH" eventState="On" eventType="bgp_down" level="ERROR" nodeId="<Node UUID>" subcomp="monitoring"] In Router <Router UUID>, BGP neighbor <Neighbor ID> is down. Reason: Network or config error.
  • For BGP peers that are Up, routing changes may not be processed
  • On the admin shell of the Edge, BGP CLIs hang and do not return any output e.g.
    #get bgp neighbor summary
  • The BGP service may crash creating a core file, as observed by the syslog entry
    <Date>T<Time>Z <Hostname> NSX 591121 - [nsx@6876 comp="nsx-edge" subcomp="node-mgmt" username="root" level="WARNING"] Core file generated: /var/log/core/core.bgpd.1697551443.5514.160.6.gz
  • /var/log/vmware/top-mem.log shows the BGP process has high memory usage and is growing linearly over time

Wed Sep ## 16:30:06 UTC 202#
15054 frr       20   0 7427796 6.944g   2520 R 200.0 22.2 127860:14 15054 /usr/lib/frr/bgpd -d -A 127.0.0.1

Wed Sep ## 17:30:07 UTC 202#
15054 frr       20   0 7427796 6.965g   2524 R 200.0 22.3 127972:59 15054 /usr/lib/frr/bgpd -d -A 127.0.0.1

Wed Sep ## 18:30:07 UTC 202#
15054 frr       20   0 7427796 6.981g   2524 R 200.0 22.4 128056:36 15054 /usr/lib/frr/bgpd -d -A 127.0.0.1

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.2.x

Cause

This issue occurs when the main BGP thread gets stuck in a loop after referencing a stale pointer.
The BGP process will eventually crash out of memory and automatically be restarted. Any BGP peering that was down will come back up once the service restarts.

Resolution

This issue is resolved in VMware NSX 4.2.0, available at Broadcom downloads.

If you are having difficulty finding and download software, please review the Download Broadcom products and software KB.

 

Workaround:

If an Edge is in a broken state with BGP down, the following steps will recover it in a planned manner:

  1. Under System->Fabric->Nodes select the Edge and from the Actions menu put the Edge into Maintenance Mode.
    Note maintenance mode will result in a failover of any active Gateway(s) on the Edge.
    Entering maintenance mode will stop the BGP process and clear the error condition.
  2. From the Actions menu Exit the Edge from Maintenance Mode.
    Exiting maintenance mode will start the BGP process.

It is possible the condition could reoccur at a future time.