No alarm raised for Tier-0 or Tier-1 gateway failover after Edge vMotion.
search cancel

No alarm raised for Tier-0 or Tier-1 gateway failover after Edge vMotion.

book

Article ID: 424813

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Following the vMotion of an NSX Edge node, a network flap may occur, causing the active Tier-0 and Tier-1 gateways to fail over.

While reviewing the system state, you observe that no alarms were raised by the NSX Manager for this failover event.

  • In the Edge /var/log/syslog, you observe tunnel instability where tunnels go down, and gateways are marked as unreachable (NodeDown).

    2025-12-16T11:26:22.276Z edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########843f event NodeDown [Active,Unreachable] reason 'Tunnels Down'
    2025-12-16T11:26:22.276Z edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########004d event NodeDown [Active,Unreachable] reason 'Tunnels Down'

  • This is immediately followed by "context report" messages indicating a T0/T1 gateway failover.

    2025-12-16T11:26:22.276Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Tunnels Down"}
    2025-12-16T11:26:22.277Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Tunnels Down"}

  • The tunnels recover within a very short duration (approximately 500ms), and the gateways on the previously active edge transition back to NodeUp.

    2025-12-16T11:26:22.734Z edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########843f event NodeUp [Down,Unknown] reason 'Tunnels Up'
    2025-12-16T11:26:22.734Z edge_hostname NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="svcrt-fsm" level="INFO" org="default" proj="####"] ########-####-####-####-########004d event NodeUp [Sync,Unknown] reason 'Tunnels Up'

  • Despite the confirmed failover in the logs, the NSX Manager UI shows no active or historical alarms for "Tier-0 Gateway Failover" or "Tier-1 Gateway Failover."

Environment

VMware NSX

Cause

The root cause is the timing discrepancy between the duration of the transient failover event and the alarm collection sampling interval.

The NSX alarm framework uses a polling mechanism to check the state of specific events. For gateway failover events (tier0_gateway_failover and tier1_gateway_failover), the standard sampling_interval is 60 seconds.

In this scenario:

  1. The network flap caused by vMotion is transient. The gateway enters a failed state (eventState="On") but recovers almost immediately (eventState="Off").

    2025-12-16T11:26:22.276Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Tunnels Down"}
    2025-12-16T11:26:22.277Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="On" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Active","current_gateway_state":"Down","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Tunnels Down"}
    [....]
    [....]
    346109:2025-12-16T11:26:23.472Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########aedc" tid="1" level="ERROR" eventState="Off" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Down","current_gateway_state":"Active","entity_id":"########-####-####-####-########aedc","service_router_id":"########-####-####-####-########843f","failover_reason":"Remote state changed to Active"}
    349790:2025-12-16T11:26:26.837Z edge_hostname NSX 1 - [nsx@6876 comp="nsx-edge" subcomp="nsx-edge-agent" s2comp="nsx-monitoring" entId="########-####-####-####-########63ac" tid="1" level="ERROR" eventState="Off" eventFeatureName="high_availability" eventSev="error" eventType="tier1_gateway_failover"] Context report: {"previous_gateway_state":"Down","current_gateway_state":"Active","entity_id":"########-####-####-####-########63ac","service_router_id":"########-####-####-####-########004d","failover_reason":"Remote state changed to Active"}

  2. For example in the above, the event was "On" at 11:26:22 and revert to "Off" by 11:26:26 (a duration of only 4 seconds).

  3. If the NSX Manager alarm collector executes its check outside of this specific 4-second window, it reads the status as Off.

Because the alarm framework is currently not designed to latch onto or aggregate high-frequency "flapping" events that resolve faster than the polling cycle, the alarm is not triggered.

Resolution

This behavior is a known limitation of the current alarm framework regarding transient states caused by rapid network flapping.

The Engineering team is aware of this limitation and may implement aggressive alarm collection in future releases.