HA Split Brain is the event in which peer members in the same High Availability setup lose communication over their heartbeat interface GE1.
This feature relies heavily in the Primary Gateway to resolve roles for each device in HA.
Split Brain recovery mechanism was introduced in 3.x train and current behavior details that when HA splits, the Standby takes over as Active and sends a HA_REINIT to the Gateway. Such request is honored by the Gateway and accepts this unit as the new Active; at the same time, it sends a GO_STANDBY to the old Active.
The following logs on the Gateway indicate described operation above:
dbgctl -i | grep -e HAD -e hamsg -e ACTIVE -e STANDBY
2019-06-18T07:37:16.304 SEVERE [VCMP] vcmp_handle_ha_init_req:851 td:310562882. New request from bc952f3e. Send GO_STANDBY to edge with serial:d6e1b93e
2019-06-18T07:37:16.304 INFO [VCMP] vcmp_send_ha_go_standby_for_peer:3829 Send HA GO_STANDBY from GW for d6e1b93e to td:310562882 (bc952f3e)
2019-06-18T07:37:16.304 INFO [VCMP] vcmp_send_ha_go_standby_for_peer:3829 Send HA GO_STANDBY from GW for d6e1b93e to td:310567019 (d6e1b93e)
2019-06-18T07:37:16.304 MSG [VCMP] hamsg_send_evt_go_standby:175 Send HA EVENT on td 310567019 (d6e1b93e).
It is important to notice that during normal HA operation, only the Active Edge will be communicating to the Gateway. Hence, when Standby attempts to reach the VCG is and indication of having lost communication with its HA peer.
Now, the operation varies depending on the deployment type, Legacy or Enhanced HA.
Legacy HA.
Let's first recall that WAN interfaces in this mode are physically sharing the same Layer 2 broadcast domain for both Edges:

When the HA heartbeat link between devices is lost, a layer 2 heartbeat is sent from Active on its WAN interfaces in an effort to find the Standby in that broadcast network. When the Standby receives this packet, it is an indication to keep its current state.
This is the log that will be reported on the Standby:
dbgctl -if | grep HAD
2019-06-18T07:44:41.929 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE3 up:1 sys_up:1
2019-06-18T07:44:41.929 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE4 up:1 sys_up:1
2019-06-18T07:44:41.929 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE5 up:1 sys_up:1
2019-06-18T07:44:42.229 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE3 up:1 sys_up:1
2019-06-18T07:44:42.229 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE4 up:1 sys_up:1
2019-06-18T07:44:42.229 INFO [HAD] ha_intf_hb_recv_packet:110 [S] Process HB-Received ether type 0x9999 on GE5 up:1 sys_up:1
Note that three different WAN interfaces exist in this setup (GE3, GE4, GE5) and that all of them received the heartbeat message.
On VCO, HA Peer State Unknown should be logged when this happens. At Gateway level there are no changes.
Failing to receive such heartbeats will have the Standby transition to Active, this in the event that the HA heartbeat link is lost because the Active Edge went down or due to not having a healthy Layer 2 segment for these WAN interfaces to interact and exchange packets.
Enhanced HA.
Now, as enhanced HA gives the ability to use active WAN connections on each member of the setup, Split Brain recovery and detection involved Gateway all the time as there's no common layer 2 segment for the WAN interfaces between Edges to confirm states with each other.
When the cable goes down, Standy will inform the Gateway with a HA_REINIT and go Active, Gateway will act upon it by sending GO_STANDBY to current Active, this is how it shows:
root@vc-gateway-1:/# dbgctl -if | grep -e hamsg -e ACTIVE -e STAND
2019-06-18T08:03:49.689 INFO [VCMP] vcmp_send_ha_go_standby_for_peer:3829 Send HA GO_STANDBY from GW for bc952f3e to td:1487077158 (bc952f3e)
2019-06-18T08:03:49.689 MSG [VCMP] hamsg_send_evt_go_standby:175 Send HA EVENT on td 1487077158 (bc952f3e).
2019-06-18T08:03:49.689 INFO [VCMP] vcmp_send_ha_go_standby_for_peer:3829 Send HA GO_STANDBY from GW for bc952f3e to td:1544327714 (d6e1b93e)
At the former Active level, this log will be seen:
2019-06-18T08:03:49.689 SEVERE [HAD] ha_go_standby_by_gw:1804 HA State Transition to STANDBY : Reason - Gateway instructed.
Whenever an Edge goes to Standby with that instruction, it will show the following parameter set to 1 in the output of debug.py --ha verp:
"standby_by_gw": 1
While an Edge in HA is in Standby, it will keep sending MP_INIT messages to the Gateway, which will only accept these in the event of Active not having a path to it anymore considering it may be down.
Troubleshooting Common Problems
After failover, Edge does not have the routes.
This can happen when the window between gateway and active Edge somehow is not in sync. Check the outputs of path in Gateway to see if it has the proper edge_id as active.
Both Edges are Active.
Check if both Edges have path stable to the primary Gateway, if so, check on the Gateway which edge is considered active.
If one of the Edges is not able to establish a path to the primary VCG, this can happen.
Heavy layer 2 broadcast traffic on the WAN interface.
A possible reason can be due to the heartbeats exchanged between Edges for Legacy HA method to identify the peer when GE1 goes down.
This mechanism can be temporarily disabled if needed, go to edit /etc/config/edged and set wan_hb_enabled to 0 on both the edges.
Please consider the downsides of this workaround.
No Edge takes Active role.
Check if WAN interface heartbeats are received on incorrect interfaces. Run command:
dbgctl -if | grep -e ha_process_hb_in_active -e ha_intf_hb_recv_packet
This will show the interface on which these heartbeats are received and processed.
Edge shell prompt shows Active-Active but HA state in verp output is expected Active-Standby.
The prompt in the edge shell is derived from /tmp/ha/localUiState file.
This is populated by the HA event sent from edged to mgd about the state transitions.
If the mgd HA thread is, for some reason, stuck and did not process the event from edged, the file never gets updated.
Other possibility is the 300s timer on mgd that runs to stop trying to communicate with VCO on a failover, post that will remain as standby.
If mgd heartbeat is fine and still not updated, you can try debug.py --ha mgd_update.
Management Daemon Standby is trying to establish/push session to VCO directly.
Management Daemon might try to establish directly to VCO when it has not been assigned STANDBY Edge role from edged.
For 300s to identify whether the configuration is a Last known good settings or not, mgd will keep attempting. Post the 300s, it will stop.
If it does not, debug.py --ha mgd_update can be used to trigger a state event notification from edged to mgd.