BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.
Topology: Two-Tier Clos Architecture (Leaf-Spine) - See Additional Information for more data on this Architecture
NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).
BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H
Route Uplink Uplink-IP
--------- ------------- -----------------
100.100.100.100 uplink-100 100.100.100.101/X
200.200.200.200 uplink-200 200.200.200.201/Y
BFD events from Edge's syslog
Log Location - /var/log/syslog
2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}
2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}
FRR logs
Log Location - /var/log/frr/frr.log
2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]
2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]
BFD session
Commands -
get bfd-sessions
get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1
That will show something like
"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
.
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43",
"last_up_time" - "last_down_time" = BFD session flap downtime.
In this example, BFD was down for 28 seconds.
Edge's syslog
Log Location - /var/log/syslog
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657
Edge host vmkernel.log
Log Location - /var/run/log/vmkernel.log
var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
Edge host vobd.log
Log Location - /var/run/log/vobd.log
Command -
esxcli vsan debug object list --vm-name=<vsan-vm-name>
Example from lab -
[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
Used: 77.46 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
Used: 1.47 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)
In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.
In the ESX host (/var/run/log/vobd.log) search for the disk UUID to see if there is a heartbeat time out and recovery.
Example -
grep -i 'heartbeat.timedout.*01f3665f-3262-e8ce-73c1-0050569469ae' vobd.log
If a heartbeat timeout is occurring on the disk UUID, the logs should look similar to,
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
An example of how to identify the vlan used for vSAN traffic :
[root@esxa-01:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack
vmk2 1 IPv4 #.#.#.# 255.255.255.192 #.#.#.# 00:##:##:##:##:18 9000 65535 true STATIC defaultTcpipStack
[root@esxa-01:~] esxcfg-vswitch -l
DVS Name Num Ports Used Ports Configured Ports MTU Uplinks
RegionA01-VDS8 3220 12 512 8900 vmnic0,vmnic1
DVPort ID In Use Client
1 1 vmk2
[root@esxa-01:~] net-stats -l
PortNum Type SubType SwitchName MACAddress ClientName
67108877 3 0 DvsPortset-0 00:##:##:##:##:18 vmk2
[root@esxa-01:~] net-dvs -l
port 1:
com.vmware.common.port.alias = , propType = CONFIG
com.vmware.common.port.connectid = ########## , propType = CONFIG
com.vmware.common.port.portgroupid = dvportgroup-# , propType = CONFIG
com.vmware.common.port.block = false , propType = CONFIG
com.vmware.common.port.dvfilter = filters (num = 0):
propType = CONFIG
com.vmware.common.port.ptAllowed = 0x 0. 0 <repeats 3 times>
propType = CONFIG
<----output omitted---->
com.vmware.common.port.volatile.status = inUse linkUp portID=67108877 propType = RUNTIME
com.vmware.common.port.volatile.vlan = VLAN 1001
From the above outputs,
The vmkernel port used for vSAN traffic is vmk2.
vmk2 is connected to DVS RegionA01-VDS8 on DVPort ID 1.
Port ID 67108877 is associated with vmk2, and the VLAN Associated with that port ID 67108877 is 1001.
This way we can interpret Layer-2 networking on vlan 1001 seems to have caused vmfs heartbeat loss/timeout and BGP/BFD fell victim.
ESXI
VMware NSX
VMware NSX Data center
Although the heartbeat failure is effecting NSX this issue is indicative of a issue on the storage network connecting to the ESX host in which the edge is on due to the logging indicating VMFS heartbeat failure. Due to that we need to look outside of NSX.
We know that the VMFS heartbeat timeout error is seen between ESXI hosts in a cluster. In our example these heartbeats are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing).
This is what leads us to understanding that the physical networking layer is the most likely culprit of the concern.
From what we can see layer-2 networking on the VLAN used to exchange heartbeat between hosts caused a heartbeat timeout and in our instance BGP/BFP experienced intermittence because of it.
There should be no network disconnect/timeouts on Layer-2 networking for the VLAN used to exchange VMFS heartbeats.
Shared storage is highly dependent on a healthy network for consistent/predictable performance.
Please see additional information for links that will help you with troubleshooting VMFS heartbeat timeouts.
When it comes to understanding network connection issues on storage networks we recommend reading through these articles for a more complete understanding of the symptoms and possible resolutions
Understanding lost access to volume messages in ESXi
Troubleshooting fibre channel storage connectivity
Troubleshooting ESXi connectivity to iSCSI arrays using software initiators