On pre-emptive failback, the new standby T0 router has BGP peering stuck in Active
search cancel

On pre-emptive failback, the new standby T0 router has BGP peering stuck in Active

book

Article ID: 336812

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • Tier-0 Logical Router configured in active-standby mode with pre-emptive mode enabled
  • After a failback event from the the non-preferred node to the preferred node, BGP peerings on the new standby node are stuck in Active state
  • During this time BGP commands e.g. get bgp neighbor summary return no output
  • The issue resolves itself after a 20 minute timeout period and BGP session returns to an Established state
  • This issue is not observed for T0 logical router failover, only failback
  • The issue is non impacting to the data path as the impacted T0 router is in standby mode
  • Edge log messages similar to this may be observed
<179>1 2019-11-05T15:09:23.614Z EDGE NSX 904 - [nsx@6876 comp="nsx-edge" subcomp="agg-service" tid="1449" level="ERROR" errorCode="MPAERR_MSR_QUERY_BGP_NEIGHBOR"] [UpdateFrrBgpNeighbor] Cannot get bgp-neighbor for lrouter:


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 2.x

Cause

BGP and BFD processes connect with the routing platform to query BFD updates/status. When these queries come at the exact same time, the routing platform serves the BFD client only. This results in the BGP process getting hung until a watchdog timeout of 20 minutes restarts the process and resolves the issue.

Resolution

This is a known issue impacting VMware NSX-T Data Center 2.x

Workaround:
To prevent this issue occurring pre-emptive mode can be disabled

Alternatively BGP sessions for the standby T0 will automatically recover after a 20 minute timeout period