book
Article ID: 317803
calendar_today
Updated On:
Issue/Introduction
To identify the scenario why an service-router alarm is incorrectly raised and never cleared
Symptoms:
An alarm for SR failover is raised and never cleared
Cause
In a active-standby service-router, a split brain occurs when the heartbeat between 2 routers is lost and the standby became active. When the heartbeat resumes and the original standby (now active) goes back to standby (healing), 2 events occur - 1) detecting that peer is already active and 2) detecting self goes to standby.
There's an alarm clear trigger that says if peer changes to active, then clear alarm. This is checked in event 1), but peer has been active during entire duration of split brain, hence there is no "state change" on peer. The trigger is skipped in event 1) processing. Come event 2, the logic just blindly raises an alarm seeing self leaves active for standby. After that, there's no more trigger to clear this false alarm.
Resolution
This issue is resolved in VMware NSX-T Data Center 3.2.0.1 available at VMware Downloads
If a upgrade is not possible, then restart the standby edge using the edge CLI command:
restart service local-controller