BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.
Topology : Two-Tier Clos Architecture (Leaf-Spine)
NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).
BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H
Route Uplink Uplink-IP
--------- ------------- -----------------
100.100.100.100 uplink-100 100.100.100.101/X
200.200.200.200 uplink-200 200.200.200.201/Y
BFD events from Edge's syslog :
/var/log/syslog :
2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}
2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}
FRR logs :
/var/log/frr/frr.log :
2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]
2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]
BFD session :
edge> get bfd-sessions
edge> get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1
"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
.
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43",
"last_up_time" - "last_down_time" = BFD session flap downtime.
In this example, BFD was down for 28 seconds.
Edge's syslog :
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657
Edge host vmkernel.log :
var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
Edge host vobd.log :
Command : esxcli vsan debug object list --vm-name=<vsan-vm-name>
Example :
[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
Used: 77.46 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
Used: 1.47 GB
Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)
In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - Edge disk UUID
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
VMware NSX-T
VMware NSX Data Center
vmfs heartbeat timeout error is seen between ESXi's in its cluster. These heartbeat are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing). This is where physical networking layer is coming into picture.
Layer-2 networking on vlan used to exchange vSAN heartbeat caused vmfs heartbeat timeout and BGP/BFD fell victim.
There should be no network disconnect/timeouts on Layer-2 networking for the vlan used to exchange vSAN heartbeat.
vSAN is a shared storage and highly dependent on a healthy network for better performance.