NSX network outage due to BGP route withdrawal from upstream BGP neighbors
search cancel

NSX network outage due to BGP route withdrawal from upstream BGP neighbors

book

Article ID: 422904

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You may experience a network outage affecting all or some traffic traversing a Tier-0 router.
  • One more BGP routes are missing from the routing table on the NSX Edge node(s) where the Tier-0 router is active, such as the default route (0.0.0.0/0).
  • BGP neighbourship remains established.
  • Log lines similar to the below are encountered on the NSX Edge node in /var/log/syslog
    In this example, the default route (0.0.0.0/0) is marked with 0 next hops and deleted.
    ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-nsxa" level="INFO"] Received prefix 0.0.0.0/0 with n_nexthops = 0 in table_id = 123 action = DELETE
  • Log lines similar to the below are encountered on the NSX Edge node in /var/log/frr/frr.log
    In this example, a request is received from the BGP neighbor to withdraw the default route (0.0.0.0/0).
    "wlen" means "Withdraw length".
    BGP: ###.###.##.## rcvd UPDATE wlen 1 attrlen 0 alen 0
    BGP: group_announce_route_walkcb: afi=IPv4, safi=unicast, p=0.0.0.0/0
  • If the issue was only temporary, you will see the affected route re-added.
    • Log lines similar to the below are encountered on the NSX Edge node in /var/log/syslog
      In this example, the default route (0.0.0.0/0) is marked with 1 next hop and added.
      ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="rcpm-nsxa" level="INFO"] Received prefix 0.0.0.0/0 with n_nexthops = 1 in table_id = 123 action = ADD
    • Log lines similar to the below are encountered on the NSX Edge node in /var/log/frr/frr.log
      In this example, a request is received from the BGP neighbor to announce the default route (0.0.0.0/0).
      "alen" means "Announce length".
      BGP: ###.###.##.## rcvd UPDATE wlen 0 attrlen 28 alen 1
      BGP: group_announce_route_walkcb: afi=IPv4, safi=unicast, p=0.0.0.0/0

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX

Cause

The issue is caused by the upstream BGP neighbors sending UPDATE messages to withdraw the affected route.

Analysis of the frr.log on the Edge nodes shows the receipt of updates with a withdrawn length (wlen 1), which triggers the deletion of the prefix from the routing table.
Even a brief absence of such route may cause an impact to the network connectivity.
Any application depending on the removed/re-added route for connectivity may need to refresh/reconnect.

Resolution

This is a condition that may occur in a VMware NSX environment.

This is not a defect or issue within the VMware NSX component. The NSX Edge nodes are correctly processing routing updates received from the upstream environment.

To resolve this issue, you must investigate the upstream physical network infrastructure to determine why the BGP neighbors are withdrawing the default route announcement.