When troubleshooting BGP sessions there are a few things to check and consider. This articles examines the different areas to verify, validate and troubleshoot a BGP session.
VMware NSX-T Data Center
VMware NSX
There are several reasons why BGP sessions may not get established. The following are the most common reasons:
On your BGP Troubleshooting session, here are the first few things to check and consider:
Identify which interfaces are involved in peering and which BGP states are involved.
Which BGP state (Idle; Connect; Active; OpenSent; OpenConfirm; Established) are the peers in, or cycling between?
Check in the UI → Is peering between T0 SR and physical router?
Has peering ever been stable in the Established state?
Places to check in the NSX-T UI
Networking > Tier-0 Gateways > Click three dots ellipsis > Select 'Generate BGP Summary'
This shows all of the peering relationships which have been configured on a T0, and their Connection Status (BGP state)
Networking > Tier-0 Gateways > expand BGP section > Click blue number of BGP Neighbors
Expand to show BFD / Keep Alive / Hold Timers
Commands used during troubleshooting within T0 VRF on Edge
nsx-t-edge > get logical-router
find tier0_sr vrf id
nsx-t-edge > vrf <t0_sr_vrf_id>
nsx-t-edge(tier0_sr)>
get route -> check if route exist in the routing table to reach the BGP peer BGP neighbor)
nsx-t-edge(tier0_sr)> get bgp neighbor summary ->
nsx-t-edge(tier0_sr)> get bgp neighbor ipv4
nsx-t-edge(tier0_sr)> get bgp neighbor advertised-routes (only if connection is in Established state)
nsx-t-edge(tier0_sr)> ping <bgp_neighbor> (a successful ping indicates healthy underlay network)
Note: Sometimes this ping may not be a true test, as ICMP may be blocked between neighbors.
If the ping works and the BGP is still down, check for firewall rules which may block BGP control packets, also confirm both local and remote BGP s configured correctly.
Retrieve Edge and Manager log bundles
Edge log files to review
/var/log/frr/frr.log - grep for remote peer IP or with "NOTIFICATION" and "ADJCHANGE" if there are multiple peers and need to filter adjacency change activity.
/var/log/syslog - grep for "state=BGP" to view state changes
<Edge bundle>/edge/frr_show_ip_bgp_neighbors_json
<Edge bundle>/edge/frr_show_ip_bgp_summary_json
<Edge bundle>/edge/tier0_sr_get_bgp_neighbor
Check for connectivity related issues:
Check the VLAN on the segment/Edge logical uplink and the VLAN on the external peer (BGP neighbor) interface. If the VLAN configuration does not match, ping is expected to fail.
Identify the correct VLAN to be configured and ensure it is configured on the edge segment/logical uplink and the interface on the external peer connecting to the edge.
To check the VLAN configured on the uplink interface of the edge, check the segment configuration which the uplink (T0 interface) is attached to, use the API:
GET /policy/api/v1/infra/segments/{segment-id}
Note: Replace the {segment-id} with the ID of the segment used for the uplink interface used for BGP.
To find the BGP neighbor configuration in order to know where to ping, in addition to above UI and cli option's, you can use the following API call:
GET /policy/api/v1/infra/tier-0s/<tier-0-id>/locale-services/<locale-service-id>/bgp/neighbors
Note: Replace <tier-0-id> with the ID of the T0 BGP you are investigating.
Replace <locale-service-id> with locale-service ID for the T0, usually default.
Check for configuration-related issues
For the configured BGP neighbor, verify if the neighbor address, AS number, remote AS, keepalive timer, hold timers, and password, if configured, are configured correctly on the edge node and the external peer.
Ensure the neighbor admin state is enabled.
To verify the neighbor configuration, use the API 'GET /policy/api/v1/infra/tier-0s/<tier-0-id>/locale-services/<locale-service-id>/bgp/neighbors'.
Packet capture on the edge, packet captures help identify issues in packets transmitted and received by the edge node.
Invoke the NSX CLI command 'get logical-routers'.
Switch to the service router {sr_id} using the NSX CLI command 'vrf {vrf_id_of_service_router}'.
Invoke the NSX CLI command 'get interface'.
Identify the uplink interface ID for packet capture and exit out of the VRF.
Invoke the command 'start capture interface <interface-name> [file <filename>] [count <packet-count>] [expression <expression>]'.
For filtering BGP packets, use the expression port 179 in the CLI.
Note: Please use packet captures only when traffic rate is less than 100K pps
To check the traffic rate, invoke the command 'get dataplane cpu stats'
Resources/Documentation | Link |
Cisco BGP Essential Training | BGP Essential Training |
VMware NSX-T Admin Guide | Configure BGP |
BGP session diagnostics for troubleshooting BGP session flaps on NSX-T edge node | BGP session diagnostics for troubleshooting BGP session flaps on NSX-T edge node |
NSX Reference Design |
https://community.broadcom.com/viewdocument/nsx-reference-design-guide-42-v10
|
Set debug logs on BGP |
From inside the T0 VRF After debugging is complete, To disable debug logs: 1. clear routing debug bgp all 2. clear debug |
If you are contacting Broadcom support about this issue, please provide the following:
Handling Log Bundles for offline review with Broadcom support