Troubleshooting BGP on NSX-T Edge Nodes
search cancel

Troubleshooting BGP on NSX-T Edge Nodes

book

Article ID: 339392

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

To provide education on troubleshooting BGP on NSX-T edges


Symptoms:

BGP neighbor is showing down, unable to route to neighbor or connection never up from NSX-T

You will also see not see the default route


There are going to be a couple of different ways to diagnose this in the UI and in the Edge Node SSH
 

UI Troubleshooting -
In the UI go to the Network tab and make sure you are in Manager mode

Select the desired T0 Gateway > Action > Generate BGP Summary

This will show the BGP connection status


Edge Node CLI Troubleshooting -
Login to NSX-T Edge node admin using SSH

See if NSX is getting advertised routes in admin
get route

Find the service router T0 VRF # by running
get logical-router

VRF into the logical-router
vrf #

in the VRF check interfaces to confirm that the netmask T0 can communicate with the core router
get interfaces

from running get interfaces we see the netmask - to see if routers are peering in the VRF ping the core router gateway
ping ip-address-of-gateway

in the VRF check BGP neighbor status to see if you can see any neighbors down or in a status other than established
get bgp neighbor summary

check the routes on the VRF
get forwarding

after looking at the VRF we can look at the Edge node logs to we what other data we can see

prompt root login and enter your password
st en

go to the root directory of /var/log/
cd /var/log/

in /var/log/ create a TXT file with all data on the IP address of the BGP neighbor from Edge node logs
grep -i -r 'ip-address-of-bgp-neighbor-down' * > ip-address-of-bgp-neighbor-down.txt

less this file to see the output which will show us everything NSX logs about its relationship with the BGP neighbor
less ip-address-of-bgp-neighbor-down.txt


Upstream Troubleshooting -

BGP can go down anywhere between each side of the connection. Between that is usually switching infrastructure and eventually another core router. Investigate logs in a similar fashion as we did above to find data about BGP status in the other core router and see if anything is hung up in the switches.

Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 4.x
VMware NSX-T Data Center 3.x

Cause

The possible causes of a BGP neighbor being down are

  • State hung up

  • Service hung up

  • Host-level issue

  • Edge-level issue

  • Infrastructure-related issues between the ESX Host and BGP neighbor

  • Externally the BGP neighbor is having issues

To determine the cause you will have to find the resolution.

Clearing the state at each point in the connection will tell you if the cause is state because it will come up when the state is cleared if that is the cause.

Rebooting the service will clear the connection and refresh the state throughout NSX-T for BGP.

Moving to another host can help to determine if there is a host-level issue.

Rebooting the edge node will refresh everything in NSX-T. Many times this will provide a quick resolution but will not give good data for an investigation into the root cause because the root cause would be obtained by going through each point in the BGP connection while the connection was down and restarting services and states at each point to determine where things are actually getting hung up if you don’t have that data because BGP came back up after reboot of edge node root cause cannot be provided.

Rebooting the Active edge node in Active/Standby will cause the Standby node to move into Active.

There can also be issues with the connection throughout the switching infrastructure and on the core router handling the BGP neighbor.

It is important to know if this is recurring or not. If this is an issue that has only happened once, and it has been up for quite some time it is likely something got hung up and a refresh becomes necessary at some point in the connection.

Resolution

Because there are many reasons BGP can go down there is not only one resolution

Instead, there are many possible scenarios it is on the client/TSE to determine the best resolution for the situation

Clear all BGP neighbor connections
clear bgp neighbors

Clear specific BGP neighbor state in NSX-T
clear bgp <ip-address>

Drop state between neighbors in switching infrastructure and in the other core router - outside of NSX

Stop BGP service and start it back up
stop service bgp
start service bgp


Move Edge Node in and out of Maintenance

Restart Edge Node

Move the edge node to a host that you know edge nodes will work on to see if it is a host-level issue. If it comes back when moving investigate possible host-level issues

If things are still down after following all of the above recommendations then there is an issue outside of NSX most likely in the switching infrastructure or at the other core router investigate for resolution in that infrastructure

Additional Information

How to Configure BGP

Impact/Risks:
Connection from T0 VRF to BGP neighbor is down causing all traffic flowing across it to be down