09:30:09.479Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dpc-agent" tname="dp-ipc31" level="INFO"] Processing nsx-agent request type: delta_config_file
09:30:10.417Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Up
09:30:14.721Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dpc-pb" tname="dp-ipc31" level="INFO"] Processing delta config msg version 12345678 from nsx-agent
09:30:14.721Z <primary-edge> NSX 8723 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 5217
09:30:14.788Z <primary-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Down
09:30:16.613Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="app-ha-bridge" level="WARN"] Bridge HA: split brain heal for ########-5e77-fc40-c783-############
09:30:20.774Z <standby-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Down
09:30:20.903Z <primary-edge> NSX1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="lport" level="INFO"] lport ########-721e-42f8-84d5-############ HA Op state changed to Up
VMware NSX
In certain cases, software or hardware issues can lead to the BFD thread getting blocked.
NSX edge nodes in active/Standby mode, use BFD packets to detect if they are alive, for high availability.
BFD packets are sent every 300ms, with a threshold of 3, if 3 consecutive packet are missed, the peer is assumed to be down.
From the log entry above "BFD module wakeup interval exceeds maximum threshold. INTV: 5217" we can see the Active edge node did not process any BFD packet for approx. 5 seconds and thus exceeded the threshold.
As the Standby edge node did not receive the 3 BFD packets from the Active edge node, the Standby edge node believed the Active edge node was down and therefore became Active, as can be seen in Standby edge node log "HA Op state changed to Up".
When the blocked thread resumed, the Active edge node detected it had not received any BFD packet's for the duration the thread was blocked, in this case approx. 5 seconds, it put itself in a down state, as can be seen from the log entry "HA Op state changed to Down" on the Active edge node.
There was a brief period that both Active and standby edge nodes where active, this is when the Active resumed after the thread block cleared and before it put itself down, a split brain occurred during that period.
We then see the Standby edge node initiate a heal for the split brain in log entry "Bridge HA: split brain heal for ########-5e77-fc40-c783-############".
The split brain heal took a few seconds and when complete, the Standby edge node reverted back to Standby and the Active edge node assumed its role as Active again, as can be seen in log entries on
Standby edge node "HA Op state changed to Down" and on Active edge node "HA Op state changed to Up"
The failback (Standby to Active) occurred as the setup was using preemptive mode, if it was non preemptive mode, there would have been no failback, meaning the Standby edge node would have carried on being the Active edge node.
It was identified, when rogue endpoint protection is enabled on the physical switches, the outage can last longer due to the policies enforced there.
The is a known issue impacting VMware NSX, in VMware NSX 4.2.1, logging has been improved and Linux soft lockup detect has been increased to every 2 seconds, to detect when a process goes into a blocked state.
If you encounter this issue, we advise upgrading to 4.2.1 and if the issue occurs again, collect the manager and edge support bundle, engage the physical switch vendor and if required open a support request with Broadcom support.
If you are contacting Broadcom support about this issue, please provide the following:
Handling Log Bundles for offline review with Broadcom support
If this KB did not resolve the issue for you, please review the KB Troubleshooting NSX Edge High Availability for further troubleshooting steps.