Troubleshooting NSX BGP

Products

VMware NSX

Issue/Introduction

When troubleshooting BGP sessions there are a few things to check and consider. This articles examines the different areas to verify, validate and troubleshoot a BGP session.

Environment

VMware NSX-T Data Center
VMware NSX

Cause

There are several reasons why BGP sessions may not get established. The following are the most common reasons:

No communication between peers.
Timers mismatch.
BFD configuration mismatched.

Resolution

On your BGP Troubleshooting session, here are the first few things to check and consider:

Identify which interfaces are involved in peering and which BGP states are involved.

Which BGP state (Idle; Connect; Active; OpenSent; OpenConfirm; Established) are the peers in, or cycling between?
- Check in the UI → Is peering between T0 SR and physical router?
- Has peering ever been stable in the Established state?
Places to check in the NSX-T UI
- Networking > Tier-0 Gateways > Click three dots ellipsis > Select 'Generate BGP Summary'
  - This shows all of the peering relationships which have been configured on a T0, and their Connection Status (BGP state)
- Networking > Tier-0 Gateways > expand BGP section > Click blue number of BGP Neighbors
  - Expand to show BFD / Keep Alive / Hold Timers
  - Select 'i' next to Status to see general peering information similar to Generate BGP Summary above
Commands used during troubleshooting within T0 VRF on Edge
- nsx-t-edge > get logical-router
- find tier0_sr vrf id
- nsx-t-edge > vrf <t0_sr_vrf_id>
- nsx-t-edge(tier0_sr)> get route -> check if route exist in the routing table to reach the BGP peer BGP neighbor)
- nsx-t-edge(tier0_sr)> get bgp neighbor summary
- nsx-t-edge(tier0_sr)> get bgp neighbor ipv4
- nsx-t-edge(tier0_sr)> get bgp neighbor advertised-routes (only if connection is in Established state)
- nsx-t-edge(tier0_sr)> ping <bgp_neighbor> (a successful ping indicates healthy underlay network)
  
  Note: Sometimes this ping may not be a true test, as ICMP may be blocked between neighbors.
  
  If the ping works and the BGP is still down, check for firewall rules which may block BGP control packets, also confirm both local and remote BGP s configured correctly.
- Clear specific BGP neighbor state in NSX-T
  
  nsx-t-edge(tier0_sr)> clear bgp <ip-address>
- Clear all BGP neighbor connections
  
  nsx-t-edge(tier0_sr)> clear bgp neighbors
Retrieve Edge and Manager log bundles
- Edge log files to review
  - /var/log/frr/frr.log - grep for remote peer IP or with "NOTIFICATION" and "ADJCHANGE" if there are multiple peers and need to filter adjacency change activity.
  - /var/log/syslog - grep for "state=BGP" to view state changes
  - <Edge bundle>/edge/frr_show_ip_bgp_neighbors_json
  - <Edge bundle>/edge/frr_show_ip_bgp_summary_json
  - <Edge bundle>/edge/tier0_sr_get_bgp_neighbor
Check for connectivity related issues:
- Check the VLAN on the segment/Edge logical uplink and the VLAN on the external peer (BGP neighbor) interface. If the VLAN configuration does not match, ping is expected to fail.
- Identify the correct VLAN to be configured and ensure it is configured on the edge segment/logical uplink and the interface on the external peer connecting to the edge.
- To check the VLAN configured on the uplink interface of the edge, check the segment configuration which the uplink (T0 interface) is attached to, use the API:

GET /policy/api/v1/infra/segments/{segment-id}

Note: Replace the {segment-id} with the ID of the segment used for the uplink interface used for BGP.

- To find the BGP neighbor configuration in order to know where to ping, in addition to above UI and cli option's, you can use the following API call:

GET /policy/api/v1/infra/tier-0s/<tier-0-id>/locale-services/<locale-service-id>/bgp/neighbors

Note: Replace <tier-0-id> with the ID of the T0 BGP you are investigating.

Replace <locale-service-id> with locale-service ID for the T0, usually default.

Check for MTU related Issues
- Check the MTU setting on the TOR interface connected to the physical NIC of the DVS uplink which provides connectivity to the Tier-0 uplink.
- Refer to Guidance to Set Maximum Transmission Unit.
- Follow procedures in the KB article (Addressing Common NSX Underlying Infrastructure Connectivity Issues) to address common NSX underlying infrastructure connectivity issues.
Check for configuration-related issues
- For the configured BGP neighbor, verify if the neighbor address, AS number, remote AS, keepalive timer, hold timers, and password, if configured, are configured correctly on the edge node and the external peer.
- Ensure the neighbor admin state is enabled.
- To verify the neighbor configuration, use the API 'GET /policy/api/v1/infra/tier-0s/<tier-0-id>/locale-services/<locale-service-id>/bgp/neighbors'.
Packet capture on the edge, packet captures help identify issues in packets transmitted and received by the edge node.
- Invoke the NSX CLI command 'get logical-routers'.
- Switch to the service router {sr_id} using the NSX CLI command 'vrf {vrf_id_of_service_router}'.
- Invoke the NSX CLI command 'get interface'.
- Identify the uplink interface ID for packet capture and exit out of the VRF.
- Invoke the command 'start capture interface <interface-name> [file <filename>] [count <packet-count>] [expression <expression>]'.
- For filtering BGP packets, use the expression port 179 in the CLI.
- Note: Please use packet captures only when traffic rate is less than 100K pps
  
  To check the traffic rate, invoke the command 'get dataplane cpu stats'

Additional Information

Known Issues:

BGP flapping when the number of prefixes advertised increases

Resources/Documentation

Cisco BGP Essential Training	BGP Essential Training
VMware NSX-T Admin Guide	Configure BGP
BGP session diagnostics for troubleshooting BGP session flaps on NSX-T edge node	BGP session diagnostics for troubleshooting BGP session flaps on NSX-T edge node
NSX Reference Design	https://community.broadcom.com/viewdocument/nsx-reference-design-guide-42-v10

Logs:

Set debug logs on BGP

From inside the T0 VRF
1. set debug
2. set routing debug bgp all
3. get routing debug bgp

After debugging is complete, To disable debug logs:

1. clear routing debug bgp all
2. clear debug

If you are contacting Broadcom support about this issue, please provide the following:

State of the BGP connection reported on peer device
Are you able to ping the peer device from the T0 SR
How long as the session reported down/has this ever worked?
BGP configuration on peer device
State of the physical network

Handling Log Bundles for offline review with Broadcom support