Best Practices to Prevent Split-Brain and Dataplane Outages During Edge Node Cluster Manual Failover

search cancel

Best Practices to Prevent Split-Brain and Dataplane Outages During Edge Node Cluster Manual Failover

book

Article ID: 397982

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

A manual failover for a BMFW (Bare Metal Firewall) Edge node cluster was triggered using the command "set bridge <uuid> state active". Following the failover, both BMFW Edge nodes in the cluster entered into standby mode, resulting in a split-brain scenario. Consequently, both BME (Bare Metal Edge) nodes remained in standby mode, leading to a dataplane application outage as no edge nodes was forwarding traffic.

A high quiesce_blocked_time or dp-ipc time was observed in the Edge node syslog under /var/log. Refer to the log snippet example below.

YYYY-MM-DDTHH:MM:SS.SSSZ <Edge Node Name> NSX #### SYSTEM [nsx@#### comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN"] blocked 128000 ms waiting for dp-ipc43 to quiesce
YYYY-MM-DDTHH:MM:SS.SSSZ <Edge Node Name> NSX #### SYSTEM [nsx@#### comp="nsx-edge" subcomp="datapathd" s2comp="ovs-rcu" tname="dp-si-purge5" level="WARN" eventId="vmwNSXRCUBlockStatus"] {"event_state":0,"event_external_reason":"dp-ipc43 thread blocked to enter RCU quiesce state","event_src_comp_id":"########-####-####-####-############","event_sources":{"process_name":"dp-fp:0#012","thread_id":"dpipc43","quiesce_blocked_time":"128000"}}

Environment

VMware NSX

Cause

The command set bridge <uuid> state active is a forceful operation that immediately switches an Edge node to the Active state. It does not perform evaluation to prevent scenarios where both Edge nodes can entered into standby mode, resulting in service disruption.

Resolution

In the event of a split-brain scenario where both Edge nodes in the cluster are in standby mode, a reboot of the Edge node is required to restore functionality.

The recommended failover approach is to place the BME (Bare Metal Edge) into NSX Maintenance Mode to ensure a controlled and graceful high availability (HA) failover process. If the BME remains unresponsive for an extended period, a reboot may be performed to trigger an immediate failover.

Feedback

thumb_up Yes

thumb_down No