BFD flap on NSX Edge due to vmfs heartbeat timeout.
search cancel

BFD flap on NSX Edge due to vmfs heartbeat timeout.

book

Article ID: 371151

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center VMware NSX

Issue/Introduction

BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.


Topology : Two-Tier Clos Architecture (Leaf-Spine)





NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).

BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H

Route                      Uplink               Uplink-IP
---------                    -------------          -----------------
100.100.100.100    uplink-100       100.100.100.101/X
200.200.200.200    uplink-200       200.200.200.201/Y


BFD events from Edge's syslog :

/var/log/syslog :

2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}

2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}


FRR logs :

/var/log/frr/frr.log :

2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]

2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]


BFD session :

edge> get bfd-sessions 

edge> get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1

"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
 .
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43", 


"last_up_time" - "last_down_time" = BFD session flap downtime.

In this example, BFD was down for 28 seconds.


Edge's syslog :

syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657


Edge host vmkernel.log :

var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:


Edge host vobd.log :

Command : esxcli vsan debug object list --vm-name=<vsan-vm-name>

Example :

[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
   Used: 77.46 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
   Used: 1.47 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)


In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.


xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx - Edge disk UUID

2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx


 

Environment

VMware NSX-T
VMware NSX Data Center

Cause

vmfs heartbeat timeout error is seen between ESXi's in its cluster. These heartbeat are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing). This is where physical networking layer is coming into picture.

Layer-2 networking on vlan used to exchange vSAN heartbeat caused vmfs heartbeat timeout and BGP/BFD fell victim.


Resolution

There should be no network disconnect/timeouts on Layer-2 networking for the vlan used to exchange vSAN heartbeat.

vSAN is a shared storage and highly dependent on a healthy network for better performance.