“Split Brain detected” error in NSX 6.2.3 DLR HA nodes
search cancel

“Split Brain detected” error in NSX 6.2.3 DLR HA nodes

book

Article ID: 336569

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • Both primary and secondary HA nodes are in an Active state
  • In the NSX Manager System Events log, you see an entry similar to:

    Critical, Code:'30205', 'Split Brain detected for NSX Edge edge-X with HighAvailability.

    Note: For additional symptoms and log entries, see the Additional Information section.


Environment

VMware NSX for vSphere 6.2.x

Cause

NSX 6.2.3 uses Bidirectional Forwarding Detection (BFD) to detect High Availability (HA) node availability.

If HA nodes are run for 24 days, BFD goes down and get stuck in an init state. In this state, the node is unable to send BFD control packets but it continues to receive them.

When BFD is down, as a backup mechanism, ARP probes are sent through other interfaces to detect if the other node is still reachable. If there is a response, nodes continues to stay in an Active Standby state and no issues are experienced. If there is no response, then the Standby node moves to an Active state.

When two HA nodes are in an Active Active state, it is known as Split Brain and can result in network disruption. This configuration state remains until corrective action is taken to bring BFD back up


When two HA nodes are in an Active Active state, it is known as Split Brain and can result in network disruption. This configuration state remains until corrective action is taken to bring BFD back up.

Resolution

This issue is resolved in VMware NSX for vSphere 6.2.4.

If you are unable upgrade at this time or if you are already on a Split Brain situation:

To work around this issue, reboot both DLR nodes.

Note: There is no particular order on rebooting the DLR nodes.

To prevent this issue from occurring, use one of these options:

  • Disable DLR HA. 
  • Reboot both DLR nodes every 20 days while in maintenance mode.

    Note: There is no particular order on rebooting the DLR nodes.



Additional Information

You experience these additional symptoms:

  • When you run the show service highavailability command, you see the message similar to:

    Session via vNic_0: x.x.x.x:x.x.x.x Unreachable
  • When you run the ping command between both node's, HA interface is successful
  • When you run the show service highavailability internal command, you see similar entries in HA nodes:

    Session State Local /Remote/LCP diag
    (x.x.x.x:x.x.x.x) init Nbor Sgld Down/Active/Active

    Session State Local /Remote/LCP diag
    x.x.x.x:x.x.x.x) down Ctl Exprd/No Diag/Active