No alarm raised for Tier-0 or Tier-1 gateway failover.

Products

VMware NSX

Issue/Introduction

Following the vMotion or any failover of an NSX Edge node, a network flap may occur, causing the active Tier-0 and Tier-1 gateways to fail over.

While reviewing the system state, you observe that no alarms were raised by the NSX Manager for this failover event.

In the Edge /var/log/syslog, you observe tunnel instability where tunnels go down, and gateways are marked as unreachable (NodeDown).

<date> edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########843f event NodeDown [Active,Unreachable] reason 'Tunnels Down'
<date> edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########004d event NodeDown [Active,Unreachable] reason 'Tunnels Down'
This is immediately followed by "context report" messages indicating a T0/T1 gateway failover.

<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Tunnels Down"}
<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Tunnels Down"}
The tunnels recover within a very short duration, and the gateways on the previously active edge transition back to NodeUp.

<date> edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########843f event NodeUp [Down,Unknown] reason 'Tunnels Up'
<date> edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########004d event NodeUp [Sync,Unknown] reason 'Tunnels Up'
Despite the confirmed failover in the logs, the NSX Manager UI shows no active or historical alarms for "Tier-0 Gateway Failover" or "Tier-1 Gateway Failover."

Environment

VMware NSX

Cause

The root cause is the timing discrepancy between the duration of the transient failover event and the alarm collection sampling interval.

The NSX alarm framework uses a polling mechanism to check the state of specific events. For gateway failover events (tier0_gateway_failover and tier1_gateway_failover), the standard sampling_interval is 60 seconds.

In this scenario:

The network flap caused by vMotion is transient. The gateway enters a failed state (eventState="On") but recovers almost immediately (eventState="Off").

<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Tunnels Down"}
<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Tunnels Down"}
[....]
[....]
<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="Off" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Down","current_gateway_state":"Active","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Remote state changed to Active"}
<date> edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="Off" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Down","current_gateway_state":"Active","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Remote state changed to Active"}
For example in the above, consider the above event was occurred and recovered within a duration of only 4 seconds.
If the NSX Manager alarm collector executes its check outside of this specific 4-second window, it reads the status as Off.

Because the alarm framework is currently not designed to latch onto or aggregate high-frequency "flapping" events that resolve faster than the polling cycle, the alarm is not triggered.

Resolution

This behavior is a known limitation of the current alarm framework regarding transient states caused by rapid network flapping.
The Engineering team is aware of this limitation and may implement aggressive alarm collection in future releases.