A manual failover for a BMFW (Bare Metal Firewall) Edge node cluster was triggered using the command "set bridge <uuid> state active". Following the failover, both BMFW Edge nodes in the cluster entered into standby mode, resulting in a split-brain scenario. Consequently, both BME (Bare Metal Edge) nodes remained in standby mode, leading to a dataplane application outage as no edge nodes was forwarding traffic.
A high quiesce_blocked_time or dp-ipc time was observed in the Edge node syslog under /var/log. Refer to the log snippet example below.
YYYY-MM-DDTHH:MM:SS.SSSZ <Edge Node Name> NSX #### SYSTEM [nsx@#### comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 128000 ms waiting for dp-ipc43 to quiesceYYYY-MM-DDTHH:MM:SS.SSSZ <Edge Node Name> NSX #### SYSTEM [nsx@#### comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN" eventId="vmwNSXRCUBlockStatus"] {"event_state":0,"event_external_reason":"dp-ipc43 thread blocked to enter RCU quiesce state","event_src_comp_id":"########-####-####-####-############","event_sources":{"process_name":"dp-fp:0#012","thread_id":"dpipc43","quiesce_blocked_time":"128000"}}
VMware NSX
The command set bridge <uuid> state active is a forceful operation that immediately switches an Edge node to the Active state. It does not perform evaluation to prevent scenarios where both Edge nodes can entered into standby mode, resulting in service disruption.
In the event of a split-brain scenario where both Edge nodes in the cluster are in standby mode, a reboot of the Edge node is required to restore functionality.
The recommended failover approach is to place the BME (Bare Metal Edge) into NSX Maintenance Mode to ensure a controlled and graceful high availability (HA) failover process. If the BME remains unresponsive for an extended period, a reboot may be performed to trigger an immediate failover.