Application Outage During NSX Edge Upgrade Failover Due to Missed GARP
search cancel

Application Outage During NSX Edge Upgrade Failover Due to Missed GARP

book

Article ID: 440381

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Users may experience a network outage lasting approximately 10 minutes for applications running on NSX-managed workloads during an Edge Node upgrade or scheduled maintenance failover.

*   Workload traffic is impacted specifically during the transition of the Active Edge role to the Standby Edge node.
*   Connectivity is automatically restored after approximately 10 minutes without manual intervention.

Environment

VMware NSX

Cause

When the Active Edge enters maintenance mode or reboots, the Standby Edge successfully takes over the Active role and triggers Gratuitous ARP (GARP) updates to the transport nodes (ESXi hosts) to update their MAC-to-TEP binding tables.

In this scenario, the ESXi hosts fail to receive or process these GARP packets. Because the hosts do not update their mapping, they continue to forward encapsulated traffic to the VTEP (Virtual Tunnel Endpoint) of the now-offline edge node. 

Traffic remains blackholed until the ARP entry for the backplane/internal IP (e.g., `169.###.#.2`) expires on the host. Once the entry expires (typically ~10 minutes), the host initiates a new ARP request, receives a reply from the now-active edge node, and updates its tables, restoring traffic flow.

Resolution

To verify if this issue is occurring, check the host-side GARP counters and Edge logs:

1. Verify Edge State Transition

// Log Event in syslog confirming failover initiation

[nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="ha-cluster" level="INFO" eventId="vmwNSXClusterFailoverStatus"] {"event_state":0,"event_external_reason":"Service router switches over from Standby to Active. ","event_src_comp_id":"########-####-####-####-############","event_sources":{"id":"########-####-####-####-############","router_id":"########-####-####-####-############"}}

2. Check new active edge syslog to confirm GARP request and replies.

NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="neigh" tname="dp-learning3" level="INFO"] retry #10, announcing (########-####-####-####-############, 169.###.#.2)
NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="neigh" tname="dp-learning3" level="INFO"] retry #10, announcing (########-####-####-####-############, 169.###.#.2)

NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="arp" level="INFO"] GARP reply received for 169.###.#.2 from ##:##:##:##:##:## on lrouter port ########-####-####-####-############

3. After sometime ARP replies for backplane IP address are observed to be sent from the new edge node.

NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="arp" level="INFO"] ARP reply sent to ##:##:##:##:##:## for 169.###.#.2 from ##:##:##:##:##:## on lrouter port ########-####-####-####-############
NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="arp" level="INFO"] ARP reply sent to ##:##:##:##:##:## for 169.###.#.2 from ##:##:##:##:##:## on lrouter port ########-####-####-####-############
NSX 12236 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="arp" level="INFO"] ARP reply sent to ##:##:##:##:##:## for 169.###.#.2 from ##:##:##:##:##:## on lrouter port ########-####-####-####-############

3. Check Host Statistics
Check the Logical Interface (LIF) statistics on the impacted ESXi hosts and confirm if GARP is received.

commands/dump-vdr-info.sh.txt

LIF IPv4 Net Statistics (approx.):

        IP & ARP packets RX:                     753
        IP & ARP packets TX:                     217424
        IP packets Forwarded to Lif:             216673
        ARP Request RX:                          0
        ARP Request TX:                          1
        ARP Response RX:                  753
        GARP RX:                                 0
        GARP TX:                                 1

Recommendations

Additional Information

To know more about backplane IP movement during edge failover, refer Traffic failover when NSX Edge is placed in NSX Maintenance Mode (MM)