Split brain condition after a Global Manager failover
search cancel

Split brain condition after a Global Manager failover

book

Article ID: 324244

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:

  •  NSX-T 3.2.2
  •  Both Global Manager clusters are reported as Active
  •  UI may report sync status as Not Started
  •  NSX UI raises an alarm "GM To GM Split Brain"
  •  The following log is observed
/var/log/async-replicator/sm.log
 
2022-12-15T22:41:55.979Z  WARN nsx-rpc:APH_provider:user-executor-2 sitemanager 3763 - [nsx@6876 comp="global-manager" level="WARNING" subcomp="async-replicator"] Split brain detected, raising alarm.

/var/log/syslog.log

2022-12-15T22:43:37.236Z hostname NSX 3142 MONITORING [nsx@6876 alarmId="########-####-####-####-########39fc" alarmState="OPEN" comp="global-manager" entId="########-####-####-####-########36d4" errorCode="MP701099" eventFeatureName="federation" eventSev="CRITICAL" eventState="On" eventType="gm_to_gm_split_brain" level="FATAL" nodeId="########-####-####-####-########1838" subcomp="monitoring"] Multiple Global Manager nodes are active: ########-####-####-####-########36d4,########-####-####-####-########935d. Only one Global Manager node must be active at any time.



Environment

VMware NSX-T Data Center 3.x
VMware NSX 4.0.0.1
VMware NSX-T Data Center

Cause

The condition of a split brain occurs when 2 Global Managers believe they are active and have the same epoch. In this case this occurs due to a race condition handling site configuration updates.

Resolution

This issue is resolved in NSX 3.2.3 available from the VMware Customer Connect portal.

Workaround:
GM Site 1
GM Site 2

First determine the current state on both GMs.

In this example we have verified that site1 should be ACTIVE and the following proccedure is used to reset the state of site2

1) remove extra resource (not doing anything from site2) on site2 GM:
DELETE https://site2/global-manager/api/v1/global-infra/global-managers/site1


2) site2 is changed from ACTIVE to STANDBY using internal API (and do NOT change any field name as it is intentional to send the request exactly in this manner: 
ssh as root user to site2 GM (This API is internal and must be run directly on the GM: 
curl -X POST -ik http://localhost:7441/api/v1/sites?action=set_global_manager -H "Content-Type: application/json" -d '{"status":"STANDBY","force":false,"federation_id":"","gm_name":""}'

If this does not work the force option can be tried
curl -X POST -ik http://localhost:7441/api/v1/sites?action=set_global_manager -H "Content-Type: application/json" -d '{"status":"STANDBY","force":true,"federation_id":"","gm_name":""}'


3) On site1 Active Site, from the UI onboard the site2 GM to STANDBY