Split brain condition after a Global Manager failover
search cancel

Split brain condition after a Global Manager failover

book

Article ID: 324244

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  •  You are running NSX-T Data Center version 3.2.2
  •  Both Global Manager clusters report as Active
  •  NSX-T UI may report sync status as Not Started
  •  NSX-T alarm "GM To GM Split Brain" is observed
  •  You may find similar entries in the following log files
/var/log/async-replicator/sm.log
 
2022-12-15T22:41:55.979Z  WARN nsx-rpc:APH_provider:user-executor-2 sitemanager 3763 - [nsx@6876 comp="global-manager" level="WARNING" subcomp="async-replicator"] Split brain detected, raising alarm.

/var/log/syslog.log

2022-12-15T22:43:37.236Z hostname NSX 3142 MONITORING [nsx@6876 alarmId="########-####-####-####-########39fc" alarmState="OPEN" comp="global-manager" entId="########-####-####-####-########36d4" errorCode="MP701099" eventFeatureName="federation" eventSev="CRITICAL" eventState="On" eventType="gm_to_gm_split_brain" level="FATAL" nodeId="########-####-####-####-########1838" subcomp="monitoring"] Multiple Global Manager nodes are active: ########-####-####-####-########36d4,########-####-####-####-########935d. Only one Global Manager node must be active at any time.



Environment

VMware NSX
VMware NSX-T Data Center

Cause

The condition of a split brain occurs when 2 Global Managers believe they are active and have the same epoch. In this case this occurs due to a race condition handling site configuration updates.

Resolution

This issue is resolved in VMware NSX 3.2.3, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.



Workaround:

Please ensure you have backups for all global manager and local manager clusters before proceeding. 

Verify the actual active GM by running the following API on the Local Managers. 

GET https://<local-Manager-ip>/api/v1/sites

You should see similar output where the actual active global manager is <global-manager-1-name> 

  "sites" : [ {
    "name" : "<global-manager-name-1>",
    "site_version" : "########",
    "id" : "########-####-####-####-########74b0",
    "is_federated" : false,
    "is_local" : false,
    "system_id" : 0,
    "active_gm" : "ACTIVE",

 


In this example we have verified that global-manager-name-1 should be ACTIVE and the following proccedure is used to reset the state of global-manager-name-2

1) Remove the resource which not doing anything from site2, on global-manager-name-2 Global Manager run the following API:
DELETE https://global-manager-name-2/global-manager/api/v1/global-infra/global-managers/global-manager-name-1


2) To change the global-manager-name-2 from ACTIVE to STANDBY, use internal API. Do NOT change any field names, run the command exactly as is below: 

SSH as root user to global-manager-name-2 Global Manager

This API is internal and must be run directly on the GM as is: 
curl -X POST -ik http://localhost:7441/api/v1/sites?action=set_global_manager -H "Content-Type: application/json" -d '{"status":"STANDBY","force":false,"federation_id":"","gm_name":""}'

If this does not work the force option can be tried
curl -X POST -ik http://localhost:7441/api/v1/sites?action=set_global_manager -H "Content-Type: application/json" -d '{"status":"STANDBY","force":true,"federation_id":"","gm_name":""}'


3) On global-manager-name-1, the active site, from the NSX UI, onboard the global-manager-name-2 Global Manager to STANDBY