Troubleshooting NSX TEP/BFD Tunnels between ESXi hosts and Edges
search cancel

Troubleshooting NSX TEP/BFD Tunnels between ESXi hosts and Edges

book

Article ID: 379112

calendar_today

Updated On:

Products

VMware NSX VMware vSphere ESXi

Issue/Introduction

When troubleshooting BFD tunnels between NSX components (hosts and Edges) a specific set of data must be gathered at the time of the event. This article details what documentation is required and how to gather it prior to opening a support request with Broadcom.

 

NSX Uses the TEP Tunnels for several very important reasons:

  1. East/West communication between VMs on overlay networks and different hosts
  2. Access to Edges (VM and Bare Metal) for North/South services via Service Routers
  3. Edge High Availability - See Troubleshooting NSX Edge High Availability for more information
  4. Verify health of BFD tunnels from various components - View Bidirectional Forwarding Detection Status

 

Documentation on how TEP Tunnels work can be found at the following links:

Environment

VMware NSX

VMware vSphere ESXi

Resolution

Log Locations and Keywords:

  • NSX Edge
    • HA tunnel
    • (geneve) state updated from
    • Total tunnels:
    • Process DP BFD state update
    • /var/log/syslog*
    • Relevant Edge Log Keywords
  • NSX Prepared ESXi host
    • diag: Control
    • /var/log/vmkernel*
    • Relevant Host Log Keywords

 

CLI Commands

  • ESXi hosts
    • nsxdp-cli bfd sessions list
      • List all TEP tunnels from the TEPS of this host to all other hosts and edges. Flaps are indications of network instability or example of when a TEP loses connectivity for legitimate reasons (power up/down, maintenance mode, etc.). Flaps column will have at least a 1 in a solid environment with all endpoints up.

    • vmkping -I vmk## -S vxlan -d -s 1572 <destination TEP IP>
      • Test network connectivity between two TEP endpoints from the ESXi host
        • vmkping = command
        • -I vmk## = choose with VMK interface to ping from (-eye, not -ell)
        • -S vxlan = chose vxlan / geneve overlay network stack
        • -d = mark the do not fragment bit
        • -s 1572 = set the payload packet size to 1572 bytes (maximum allowed on a 1600 MTU network)
    • esxcli network firewall set -e 0
      • Temporarily disable ESXi host's internal firewall to ensure there are no rules that may drop BFD traffic. Observe state of the tunnels on the host.
        Once ESXi firewall has been ruled out, please re-enable and validate status of the firewall:
        esxcli network firewall set -e 1
        esxcli network firewall get
  • NSX Edges
    • get bfd-sessions
      • Same as nsxdp-cli bfd sessions list on hosts
    • get bfd-sessions stats
      • Statistics regarding packets dropped and their reasons for each TEP tunnel.

 

Known configuration issues that can affect TEP tunnels

 

Known Issues with NSX BFD TEP Tunnels

 

Helpful Information regarding TEP Tunnel configurations and requirements

Additional Information

Log Line Analysis:

 

Edge /var/log/syslog*

  • 142585:2024-##-##T##:##:##.###Z <Edge-VM-Name01> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-cluster" level="INFO"] HA tunnel 192.###.###.35:192.###.###.39 state changed from Up to Unreachable
    • The remote endpoint is not sending BFD information to the local endpoint due to
      • Incorrect VLAN Tagging in the physical
      • Double VLAN Tagging in the NSX Uplink Profile
      • Firewall in physical environment is blocking communication between TEPs
      • Router in physical environment is unable to send packets to TEP endpoints
    • The environment is busy and BFD packets are getting delayed or dropped between TEP endpoints
    • These tunnel endpoints have experienced a BFD connectivity timeout. The tunnel has gone down because (non-exhaustive list of examples):
  • 142622:2024-##-##T##:##:##.###Z <Edge-VM-Name01> NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-cluster" level="INFO"] HA tunnel 192.###.###.35:192.###.###.39 state changed from Unreachable to Up
    • These tunnel endpoints have begun receiving BFD information again. The tunnel is returning to functional status.
  • NOTE: Seeing these two log lines in close proximity frequently between the same two endpoints is an indication of network flapping or high latency at one endpoint or a point in between.

ESXi host /var/log/vmkernel*

  • 2024-##-##T##:##:##.###Z cpu59:2098707)BFD_HandleStatusChange:709:[nsx@6876 comp="nsx-esx" subcomp="bfd"]local: 192.###.###.34, remote: 192.###.###.23, oldState: up, newState: down, diag: Control Detection Time Expired, type: overlay
    • Log line detailing a tunnel is down between the two IP addresses listed
  • 2024-##-##T##:##:##.###Z cpu36:2098706)BFD_HandleStatusChange:709:[nsx@6876 comp="nsx-esx" subcomp="bfd"]local: 192.###.###.34, remote: 192.###.###.23, oldState: down, newState: init, diag: Control Detection Time Expired, type: overlay
    • Log line detailing a tunnel is coming up/connectivity has been restored between the two IP addresses listed
  • 2024-##-##T##:##:##.###Z cpu36:2098706)BFD_HandleStatusChange:709:[nsx@6876 comp="nsx-esx" subcomp="bfd"]local: 192.###.###.34, remote: 192.###.###.23, oldState: down, newState: up, diag: No Diagnostic, type: overlay
    • Log line detailing a tunnel is fully up and capable of processing TEP traffic

If you are contacting Broadcom Support about this issue, please provide the following:

  • Retrieve log bundles from all NSX Edges and all NSX prepared ESXi hosts with TEP/BFD Tunnels reporting down
  • Retrieve log bundles from all NSX Managers

Handling Log Bundles for offline review with Broadcom support