BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.
Topology: Two-Tier Clos Architecture (Leaf-Spine) - See Additional Information for more data on this Architecture
NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).
BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H
Route Uplink Uplink-IP
--------- ------------- -----------------
100.100.100.100 uplink-100 100.100.100.101/X
200.200.200.200 uplink-200 200.200.200.201/Y
BFD events from Edge's syslog
Log Location - /var/log/syslog
2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}
2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}
FRR logs
Log Location - /var/log/frr/frr.log
2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]
2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]
BFD session
Commands -
get bfd-sessions
get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1
That will show something like
"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
.
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43",
"last_up_time" - "last_down_time" = BFD session flap downtime.
In this example, BFD was down for 28 seconds.
Edge's syslog
Log Location - /var/log/syslog
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657
Edge host vmkernel.log
Log Location - /var/run/log/vmkernel.log
var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
Edge host vobd.log
Log Location - /var/run/log/vobd.log
Command -
esxcli vsan debug object list --vm-name=<vsan-vm-name>
Example -
[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
Used: 77.46 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
Used: 1.47 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)
In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.
In the vobd logs search for that UUID to see if there is a heartbeat time out and recovery
Using a command such as the one below but instead of 01f3665f-3262-e8ce-73c1-0050569469ae use your UUID obtained by following the instructions above
grep -i 'heartbeat.timedout.*01f3665f-3262-e8ce-73c1-0050569469ae' vobd.log
if a heartbeat timeout is occurring on the uuid the grep should find logs similar to
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
ESXI
NSX-T
NSX Data Center
Although the heartbeat failure is effecting NSX this issue is indicative of a issue on the storage network connecting to the ESX host in which the edge is on due to the logging indicating VMFS heartbeat failure. Due to that we need to look outside of NSX.
We know that the VMFS heartbeat timeout error is seen between ESXI hosts in a cluster. In our example these heartbeats are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing).
This is what leads us to understanding that the physical networking layer is the most likely culprit of the concern.
From what we can see layer-2 networking on the VLAN used to exchange heartbeat between hosts caused a heartbeat timeout and in our instance BGP/BFP experienced intermittence because of it.
There should be no network disconnect/timeouts on Layer-2 networking for the VLAN used to exchange VMFS heartbeats
Shared storage is highly dependent on a healthy network for consistent/predictable performance.
Please see additional information for links that will help you with troubleshooting VMFS heartbeat timeouts
When it comes to understanding network connection issues on storage networks we recommend reading through these articles for a more complete understanding of the symptoms and possible resolutions
Understanding lost access to volume messages in ESXi
Troubleshooting fibre channel storage connectivity
Troubleshooting ESXi connectivity to iSCSI arrays using software initiators