Service router failover alarm not cleared after Edge split brain recovery.
search cancel

Service router failover alarm not cleared after Edge split brain recovery.

book

Article ID: 317803

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction


To identify the scenario why an service-router alarm is incorrectly raised and never cleared

Symptoms:
An alarm for SR failover is raised and never cleared

Environment

VMware NSX-T Data Center

Cause

In a active-standby service-router, a split brain occurs when the heartbeat between 2 routers is lost and the standby became active. When the heartbeat resumes and the original standby (now active) goes back to standby (healing), 2 events occur - 1) detecting that peer is already active and 2) detecting self goes to standby.

There's an alarm clear trigger that says if peer changes to active, then clear alarm. This is checked in event 1), but peer has been active during entire duration of split brain, hence there is no "state change" on peer. The trigger is skipped in event 1) processing. Come event 2, the logic just blindly raises an alarm seeing self leaves active for standby. After that, there's no more trigger to clear this false alarm.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.2.0.1 available at VMware Downloads

If a upgrade is not possible, then restart the standby edge using the edge CLI command: 

restart service local-controller