BGP Down alarm is not raised in NSX dashboard for BGP flap event
search cancel

BGP Down alarm is not raised in NSX dashboard for BGP flap event

book

Article ID: 425005

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

There is a BGP flap event occurred in the environment

nsx-t-edge(tier0_sr)> get bgp neighbor summary
Neighbor                      AS          State Up/DownTime  BFD InMsgs  OutMsgs  InPfx  OutPfx
#.#.#.#                       64521       Estab  00:00:50     UP  28741141 28741224 3      6
#.#.#.#                       64521       Estab  00:00:50     UP  28740821 28741028 3      6

 

Alarm for BGP flap event is not raised by NSX Manager.

Edge syslog shows, that edge node issued a 'Context Report' message, which serves as a notification of the state change about rather than a critical service failure alarm. 

YYYY-MM-DD-HH-mm-ss Edge.local NSX 7 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP #.#.#.#, peer_uuid: #####-####-####-####-####in SR: 455#####-####-####-####-####def0, state=BGP_DOWN

YYYY-MM-DD-HH-mm-ss Edge.local NSX 7 FABRIC [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="routing-service-realization" level="INFO"] Alarm for BGP #.#.#.#, peer_uuid: #####-####-####-####-####in SR: 455#####-####-####-####-####def0, state=BGP_UP

Environment

VMware NSX

VMware NSX-T Datacenter

Cause

As the BGP reconvergence period following this flap remained under the 60-second threshold, a formal system alarm will not generated.
For the BGP alarm, sampling interval is 60s. So if the status changed from UP -> Down  -> UP in <60s, the real alarm may not be raised.  Alarm Framework consider it as flipping case.

Resolution

By design, NSX utilizes 'Context Reports' to report flipping status without flooding the dashboard with frequent alarm notifications. A formal system alarm is only triggered if the BGP flap exceeds the 60-second sampling threshold.

Additional Information

Sampling size interval is 60 secs, alarm can be reported(raised/resolved) in manager every ~60secs
 
Reference to 60second:
If the BGP down time < 60 sec, alarm might be raised or may not raised.
If the BGP down time > 60 sec, alarm will be definitely raised.
 
 
 
Scenario 1: Alarm is Raised
BGP goes Down
At the sampling interval check, BGP is still Down.
Manager raises the alarm
 
 If the issue is present during a sampling check, the alarm is generated.
 
 
Scenario 2: No Alarm Raised
BGP goes Down briefly (less than 60 seconds)
BGP recovers before the next sampling check
At sampling interval checks, status is Healthy
 • Manager does not raise an alarm
 
If the issue is fixed before the sampling check, the alarm won't raised
 
 
Example about alarm won't be raise.
In the below example, sampling interval is also 60 secs, as we can see it happens in 10s, 70s:
 
0s Edge reported "alarm resolved condition" (healthy status)
10s sampling interval reached, no alarm being reported. <---Sampling check
{
   15s Edge reported "alarm raised condition" 
   20s Edge reported "alarm resolved condition" (healthy status)
}
70s sampling interval reached, no alarm being reported. because the latest status is still healthy. <---Sampling check
 
in this scenario, we can see even though the down time is also 5secs, but alarm won't raised.