North/South datapath impact on VMware NSX bare metal edge node bridging traffic due to BFD thread block

Products

VMware NSX

Issue/Introduction

Bare metal Edge nodes are configured in Active/Standby mode with Preemptive mode enabled.
Datapath is down between physical servers and virtual machines behind the bridges.
The physical fabric uses rogue endpoint protection policies, which may block endpoints (MAC/IP) which flaps between ports on the physical side.
In the standby edge node, which became active, the following log message can be seen in log /var/log/syslog:

09:30:09.479Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dpc-agent" tname="dp-ipc31" level="INFO"] Processing nsx-agent request type: delta_config_file

09:30:10.417Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Up

09:30:14.721Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dpc-pb" tname="dp-ipc31" level="INFO"] Processing delta config msg version 12345678 from nsx-agent

09:30:14.721Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 5217

09:30:14.788Z <primary-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Down

09:30:16.613Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="app-ha-bridge" level="WARN"] Bridge HA: split brain heal for ########-5e77-fc40-c783-############

09:30:20.774Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Down

09:30:20.903Z <primary-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Up

Environment

VMware NSX

Cause

In certain cases, software or hardware issues can lead to the BFD thread getting blocked.

NSX edge nodes in active/Standby mode, use BFD packets to detect if they are alive, for high availability.
BFD packets are sent every 300ms, with a threshold of 3, if 3 consecutive packet are missed, the peer is assumed to be down.

From the log entry above "BFD module wakeup interval exceeds maximum threshold. INTV: 5217" we can see the Active edge node did not process any BFD packet for approx. 5 seconds and thus exceeded the threshold.

As the Standby edge node did not receive the 3 BFD packets from the Active edge node, the Standby edge node believed the Active edge node was down and therefore became Active, as can be seen in Standby edge node log "HA Op state changed to Up".

When the blocked thread resumed, the Active edge node detected it had not received any BFD packet's for the duration the thread was blocked, in this case approx. 5 seconds, it put itself in a down state, as can be seen from the log entry "HA Op state changed to Down" on the Active edge node.

There was a brief period that both Active and standby edge nodes where active, this is when the Active resumed after the thread block cleared and before it put itself down, a split brain occurred during that period.
We then see the Standby edge node initiate a heal for the split brain in log entry "Bridge HA: split brain heal for ########-5e77-fc40-c783-############".
The split brain heal took a few seconds and when complete, the Standby edge node reverted back to Standby and the Active edge node assumed its role as Active again, as can be seen in log entries on
Standby edge node "HA Op state changed to Down" and on Active edge node "HA Op state changed to Up"

The failback (Standby to Active) occurred as the setup was using preemptive mode, if it was non preemptive mode, there would have been no failback, meaning the Standby edge node would have carried on being the Active edge node.

It was identified, when rogue endpoint protection is enabled on the physical switches, the outage can last longer due to the policies enforced there.

Resolution

The is a known issue impacting VMware NSX, in VMware NSX 4.2.1, logging has been improved and Linux soft lockup detect has been increased to every 2 seconds, to detect when a process goes into a blocked state.

If you encounter this issue, we advise upgrading to 4.2.1 and if the issue occurs again, collect the manager and edge support bundle, engage the physical switch vendor and if required open a support request with Broadcom support.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

NSX Manager and edge node support support bundles.
If Rogue Endpoint Protection triggered, details of the investigation by the vendor.

Handling Log Bundles for offline review with Broadcom support

If this KB did not resolve the issue for you, please review the KB Troubleshooting NSX Edge High Availability for further troubleshooting steps.