BGP on both edges went down during the NSX upgrade
search cancel

BGP on both edges went down during the NSX upgrade

book

Article ID: 421795

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • After an Edge node transitions to the Active role, the High Availability (HA) state flaps between Edge nodes, resulting in intermittent traffic loss.
  • NSX Edge logs (specifically datapathd logs) show the BFD session for the uplink interface repeatedly transitioning to a down state with the diagnostic: "Neighbor Signaled Session Down" as seen:

2025-##-####:##:##.#### edge-xxxx.xxxxx.xxxx NSX 1045207 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="bfd" tname="dp-bfd-mon4" level="INFO"] x.x.x.x->x.x.x.x/vlan: BFD state change: up->down "No Diagnostic"->"Neighbor Signaled Session Down".

  • The corresponding BGP session goes down almost immediately, triggering the HA failover/failback mechanism, which causes the continuous flapping:

2025:##:##.###Z edge-1... NSX 1044271 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP ##.##.##.##, state=BGP_DOWN

Environment

VMware NSX

Cause

The NSX Edge BFD protocol follows RFC 5880. Because diagnostic code 3 ('Neighbor Signaled Session Down') is set by the remote system, it indicates the remote neighbor decided to tear down the BFD session. Therefore, the root cause is external to the NSX Edge.

Resolution

This issue is primarily a physical network problem and must be investigated on the immediate BGP/BFD neighbor.

Follow these troubleshooting guidelines:

  • Verify the BFD and the neighbor configuration on the physical switch.
  • Check for Interface errors.
  • If a BFD session is not strictly required, disabling it for the affected BGP neighbor would stop the rapid flapping and stabilize the HA state. This is only recommended as a temporary measure if immediate access to the neighbor device is unavailable.

Additional Information

Troubleshooting Edge BFD Tunnels down