Missing Alarms for Transient Edge Failover or BGP Flap Events
search cancel

Missing Alarms for Transient Edge Failover or BGP Flap Events

book

Article ID: 425005

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX-T Edge node events, including failovers and Border Gateway Protocol (BGP) flaps, may not trigger visible alarms in the NSX Manager dashboard . While these events are recorded in Edge syslogs, the NSX Manager UI may show no logs or alarms related to the event

  • Edge VMs failover to another node in the same cluster, but the event is missing from the NSX Manager alarm log
  • BGP neighbor status transitions from Established to Down and back to Established within a short duration
  • Edge syslogs contain the following status reports, but the NSX Manager dashboard remains empty:
    • Alarm for BGP [IP_ADDRESS], peer_uuid: [UUID] ... state=BGP_DOWN
    • Alarm for BGP [IP_ADDRESS], peer_uuid: [UUID] ... state=BGP_UP

Environment

VMware NSX

VMware NSX-T Datacenter

Cause

The NSX Alarm Framework utilizes a sampling interval of approximately 60 seconds for BGP and Edge health monitoring .

  • Sampling Interval: If a status change (e.g., UP -> DOWN -> UP) occurs and resolves within a single 60-second sampling window, the manager does not raise a formal alarm because the status appears healthy during the next scheduled polling check.
  • Context Reports: NSX utilizes 'Context Reports' to track rapid status transitions (flipping) internally without flooding the dashboard UI with frequent, short-lived notifications

Resolution

This is a condition that may occur in a VMware NSX environment.

This behavior is by design to prevent alarm fatigue and dashboard flooding. To investigate transient events that do not appear in the UI, administrators should correlate Edge syslogs with external infrastructure maintenance windows

Operational Scenarios:

  • Alarm Raised: The Edge service goes Down and remains Down through the next sampling interval check. The Manager raises the alarm in the UI.
  • No Alarm Raised: The Edge service goes Down briefly (less than 60 seconds) and recovers before the next sampling check. The status is recorded as Healthy, and no UI alarm is generated