Split-brain scenario on NSX/vShield Edge configured for High Availability (HA)
search cancel

Split-brain scenario on NSX/vShield Edge configured for High Availability (HA)

book

Article ID: 344367

calendar_today

Updated On:

Products

VMware NSX Networking VMware vSphere ESXi

Issue/Introduction

Symptoms:
  • An application on the end virtual machine is not working properly due to network disruption on VMware vShield Edge configured for High Availability (HA)
  • A pair of VMware vShield Edges in High Availability mode experience a split-brain scenario
  • Running the command show service highAvailability on the vShield Edges reports them both in an active state


Environment

VMware NSX for vSphere 6.1.x
VMware NSX for vSphere 6.2.x
VMware vCloud Networking and Security 5.5.x
VMware vShield Edge 5.5.x

Cause

There is a known issue in the Heartbeat/Pacemaker (HA resource Manager) where a split-brain scenario occurs under certain conditions. This happens when HA packets are dropped occasionally for longer than the timeout and come back soon after that causing the status to flip.

When EdgeVm-0 is active and EdgeVm-1 is on standby, all the end virtual machines get the mac address of EdgeVm-0. When split-brain happens, EdgeVm-1 becomes active and sends a Gratuitous Address Resolution Protocol (GARP) packet to update the mac on the end virtual machines. The end virtual machine then starts sending traffic to EdgeVm-1. The end virtual machines get the mac of either EdgeVm-0 or EdgeVm-1, whichever is responding first for the ARP request. This in turn results in network disruption.

Resolution

This issue is resolved in VMware NSX for vSphere 6.2.4, available at VMware Downloads.

NSX 6.2.4 replaces Heartbeat/Pacemaker with Bidirectional Forwarding Detection (BFD) for the detection of node failures.

To work around this issue if you do not want to upgrade:
  1. Confirm that a split-brain scenario has occurred by verifying if both vShield Edge became active-active.
  2. Ensure that the two vShield Edge virtual machines are able to communicate (ping) with each other via the High Availability (HA) interface.

    Note: If the two vShield Edge virtual machines do not communicate, then the split-brain scenario occurs by a network issue. Repair the network and wait and see if the Edge pair resolves the split-brain automatically. This usually resolves the issue and no further action needs to be taken.
     
  3. Reboot the Edge virtual machine that is handling a lesser traffic. To determine this, check the difference of rx/tx counters during an interval using this command:

    show interface vNIC_0
     
  4. After a successful reboot of the first vShield Edge virtual machine, reboot the other Edge virtual machine.
  5. After a successful reboot of the second Edge virtual machine, increase the DeclaredDeadTime to 15 through the UI/REST.

    URI for 6.0.x edges : /api/4.0/edges/{edgeId}/highavailability/config
    URI for 5.x edges : /api/3.0/edges/{edgeId}/highavailability/config
 
Note: VMware recommends you to configure a dedicated vNIC/pNIC for the HA interface.


Additional Information

To be alerted when this article is update, click Subscribe to Article in the Actions box.

在为 High Availability (HA) 配置的 NSX/vShield Edge 上出现裂脑情况
高可用性 (HA) 向けに構成された NSX/vShield Edge でのスプリット ブレイン シナリオ