BFD flap on NSX Edge due to VMFS heartbeat timeout
search cancel

BFD flap on NSX Edge due to VMFS heartbeat timeout

book

Article ID: 371151

calendar_today

Updated On:

Products

VMware NSX-T Data Center VMware NSX VMware vSphere ESXi

Issue/Introduction

BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.


Topology: Two-Tier Clos Architecture (Leaf-Spine) - See Additional Information for more data on this Architecture





NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).

BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H

Route                      Uplink               Uplink-IP
---------                    -------------          -----------------
100.100.100.100    uplink-100       100.100.100.101/X
200.200.200.200    uplink-200       200.200.200.201/Y

BFD events from Edge's syslog

Log Location - /var/log/syslog

2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}

2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}


FRR logs

Log Location - /var/log/frr/frr.log

2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]

2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]


BFD session

Commands -

get bfd-sessions 

get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1

That will show something like

"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
 .
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43", 


"last_up_time" - "last_down_time" = BFD session flap downtime.

In this example, BFD was down for 28 seconds.

Edge's syslog

Log Location - /var/log/syslog

syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657


Edge host vmkernel.log

Log Location - /var/run/log/vmkernel.log

var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:


Edge host vobd.log

Log Location - /var/run/log/vobd.log

Command -

esxcli vsan debug object list --vm-name=<vsan-vm-name>

Example -

[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
   Used: 77.46 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
   Used: 1.47 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)


In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.

In the vobd logs search for that UUID to see if there is a heartbeat time out and recovery

Using a command such as the one below but instead of 01f3665f-3262-e8ce-73c1-0050569469ae use your UUID obtained by following the instructions above

grep -i 'heartbeat.timedout.*01f3665f-3262-e8ce-73c1-0050569469ae' vobd.log

if a heartbeat timeout is occurring on the uuid the grep should find logs similar to

2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Environment

ESXI

NSX-T

NSX Data Center

Cause

Although the heartbeat failure is effecting NSX this issue is indicative of a issue on the storage network connecting to the ESX host in which the edge is on due to the logging indicating VMFS heartbeat failure. Due to that we need to look outside of NSX.

We know that the VMFS heartbeat timeout error is seen between ESXI hosts in a cluster. In our example these heartbeats are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing).

This is what leads us to understanding that the physical networking layer is the most likely culprit of the concern.

From what we can see layer-2 networking on the VLAN used to exchange heartbeat between hosts caused a heartbeat timeout and in our instance BGP/BFP experienced intermittence because of it. 

Resolution

There should be no network disconnect/timeouts on Layer-2 networking for the VLAN used to exchange VMFS heartbeats

Shared storage is highly dependent on a healthy network for consistent/predictable performance.

Please see additional information for links that will help you with troubleshooting VMFS heartbeat timeouts

Additional Information

When it comes to understanding network connection issues on storage networks we recommend reading through these articles for a more complete understanding of the symptoms and possible resolutions

Understanding lost access to volume messages in ESXi

Troubleshooting fibre channel storage connectivity

Troubleshooting ESXi connectivity to iSCSI arrays using software initiators

Troubleshooting LUN connectivity issues on ESXi hosts

Understanding CLOS Networks in Data Centers