BFD flap on NSX Edge due to VMFS heartbeat timeout

Products

VMware NSX-T Data Center VMware NSX VMware vSphere ESXi

Issue/Introduction

BFD flapped during network switch upgrade. Spine's (BGP/BFD peer) were being upgraded. BFD went down during the Spine's upgrade.

Topology: Two-Tier Clos Architecture (Leaf-Spine) - See Additional Information for more data on this Architecture

NSX Edge establishes BGP Neighborship with 2 peers. Peer A (1.1.1.1) and Peer B (2.2.2.2).

BGP Neighbor A IP : 1.1.1.1 | BGP Source IP : A.B.C.D
BGP Neighbor B IP : 2.2.2.2 | BGP Source IP : E.F.G.H

Route Uplink Uplink-IP
--------- ------------- -----------------
100.100.100.100 uplink-100 100.100.100.101/X
200.200.200.200 uplink-200 200.200.200.201/Y

BFD events from Edge's syslog

Log Location - /var/log/syslog

2024-06-15T21:24:43.768Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"1.1.1.1"}

2024-06-15T21:24:43.746Z NSX 4479 - [nsx@6876 comp="nsx-edge" s2comp="nsx-monitoring" entId="xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" tid="4479" level="ERROR" eventState="On" eventFeatureName="routing" eventSev="error" eventType="bfd_down_on_external_interface"] Context report: {"entity_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","sr_id":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","peer_address":"2.2.2.2"}

FRR logs

Log Location - /var/log/frr/frr.log

2024/06/15 21:24:43.746520 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-100] bfd status [Down] vrf [default] peer [1.1.1.1] local [A.B.C.D]

2024/06/15 21:24:43.757148 ZEBRA: zebra_ptm_handle_bfd_msg: Recv Port [uplink-200] bfd status [Down] vrf [default] peer [2.2.2.2] local [E.F.G.H]

BFD session

Commands -

get bfd-sessions 

get bfd-session local-ip A.B.C.D remote-ip 1.1.1.1

That will show something like

"local_address": "A.B.C.D",
"remote_address": "1.1.1.1",
.
"last_local_down_diag": "Control Detection Time Expired",
 .
"last_up_time": "2024-06-15 21:25:11",
"last_down_time": "2024-06-15 21:24:43",

"last_up_time" - "last_down_time" = BFD session flap downtime.

In this example, BFD was down for 28 seconds.

Edge's syslog

Log Location - /var/log/syslog

syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 3667
syslog.2.gz:2024-06-15T21:24:43.745Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 3316
syslog.2.gz:2024-06-15T21:24:43.742Z NSX 5659 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 3657

Edge host vmkernel.log

Log Location - /var/run/log/vmkernel.log

var/run/log/vmkernel.log:2024-06-15T21:24:20.178Z cpu0:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:30.179Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:37.183Z cpu48:2097726)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:
var/run/log/vmkernel.log:2024-06-15T21:24:40.181Z cpu1:2196364)HBX: 3058: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx': HB at offset 3178496 - Waiting for timed out HB:

Edge host vobd.log

Log Location - /var/run/log/vobd.log

Command -

esxcli vsan debug object list --vm-name=<vsan-vm-name>

Example from lab -

[root@ESXi:~] esxcli vsan debug object list --vm-name=Edge | egrep 'Object|Used:|Path:'
Object UUID: 08f3665f-92c0-bd89-3441-0050569469ae
   Used: 77.46 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/01f3665f-3262-e8ce-73c1-0050569469ae/Edge.vmdk (Exists)
Object UUID: 01f3665f-3262-e8ce-73c1-0050569469ae
   Used: 1.47 GB
   Path: /vmfs/volumes/vsan:52a0628d7cd545d6-9089af9d8efe3453/Edge (Exists)

In the above example, 01f3665f-3262-e8ce-73c1-0050569469ae is the disk UUID of Edge VM.

In the ESX host (/var/run/log/vobd.log) search for the disk UUID to see if there is a heartbeat time out and recovery.

Example -

grep -i 'heartbeat.timedout.*01f3665f-3262-e8ce-73c1-0050569469ae' vobd.log

If a heartbeat timeout is occurring on the disk UUID, the logs should look similar to,

2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522087155209us: [vob.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:16.558Z: [vmfsCorrelator] 44522261230788us: [esx.problem.vmfs.heartbeat.timedout] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522114330979us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy (xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx): [Timeout] [HB state xxxxxxxx offset 3178496 gen 19 stampUS 44522114326355 uuid yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy jrnl <FB 13> drv 24.82]
2024-06-15T21:24:43.734Z: [vmfsCorrelator] 44522288406572us: [esx.problem.vmfs.heartbeat.recovered] yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

An example of how to identify the vlan used for vSAN traffic :

[root@esxa-01:~] esxcfg-vmknic -l
Interface  Port Group/DVPort/Opaque Network        IP Family IP Address                              Netmask               Broadcast    MAC Address       MTU     TSO MSS   Enabled Type     NetStack
vmk2       1                                       IPv4      #.#.#.#                                 255.255.255.192       #.#.#.#      00:##:##:##:##:18 9000    65535     true    STATIC   defaultTcpipStack

[root@esxa-01:~] esxcfg-vswitch -l
DVS Name         Num Ports   Used Ports  Configured Ports  MTU     Uplinks
RegionA01-VDS8   3220        12          512               8900    vmnic0,vmnic1

  DVPort ID                               In Use      Client
  1                                       1           vmk2

[root@esxa-01:~] net-stats -l
PortNum          Type SubType SwitchName       MACAddress         ClientName
67108877            3       0 DvsPortset-0     00:##:##:##:##:18  vmk2

[root@esxa-01:~] net-dvs -l
port 1:
                com.vmware.common.port.alias =  ,       propType = CONFIG
                com.vmware.common.port.connectid = ########## ,         propType = CONFIG
                com.vmware.common.port.portgroupid = dvportgroup-# ,   propType = CONFIG
                com.vmware.common.port.block = false ,  propType = CONFIG
                com.vmware.common.port.dvfilter = filters (num = 0):
                        propType = CONFIG
                com.vmware.common.port.ptAllowed = 0x 0. 0 <repeats 3 times>
                        propType = CONFIG
	                <----output omitted---->
                com.vmware.common.port.volatile.status = inUse linkUp portID=67108877  propType = RUNTIME
                com.vmware.common.port.volatile.vlan = VLAN 1001

From the above outputs,

The vmkernel port used for vSAN traffic is vmk2.
vmk2 is connected to DVS RegionA01-VDS8 on DVPort ID 1.
Port ID 67108877 is associated with vmk2, and the VLAN Associated with that port ID 67108877 is 1001.

This way we can interpret Layer-2 networking on vlan 1001 seems to have caused vmfs heartbeat loss/timeout and BGP/BFD fell victim.

Environment

ESXI
VMware NSX
VMware NSX Data center

Cause

Although the heartbeat failure is effecting NSX this issue is indicative of a issue on the storage network connecting to the ESX host in which the edge is on due to the logging indicating VMFS heartbeat failure. Due to that we need to look outside of NSX.

We know that the VMFS heartbeat timeout error is seen between ESXI hosts in a cluster. In our example these heartbeats are exchanged between hosts in the vSAN cluster on pure Layer-2 network (no routing).

This is what leads us to understanding that the physical networking layer is the most likely culprit of the concern.

From what we can see layer-2 networking on the VLAN used to exchange heartbeat between hosts caused a heartbeat timeout and in our instance BGP/BFP experienced intermittence because of it.

Resolution

There should be no network disconnect/timeouts on Layer-2 networking for the VLAN used to exchange VMFS heartbeats.

Shared storage is highly dependent on a healthy network for consistent/predictable performance.

Please see additional information for links that will help you with troubleshooting VMFS heartbeat timeouts.

Additional Information

When it comes to understanding network connection issues on storage networks we recommend reading through these articles for a more complete understanding of the symptoms and possible resolutions

Understanding lost access to volume messages in ESXi

Troubleshooting fibre channel storage connectivity

Troubleshooting ESXi connectivity to iSCSI arrays using software initiators

Troubleshooting LUN connectivity issues on ESXi hosts

Understanding CLOS Networks in Data Centers