Traffic interruption during Edge upgrade due to physical switch GARP processing failure
search cancel

Traffic interruption during Edge upgrade due to physical switch GARP processing failure

book

Article ID: 432419

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • During a NSX upgrade, application traffic may be interrupted for an extended period following an Edge High Availability (HA) failover. While the NSX-T upgrade completes successfully, ingress traffic fails to reach the newly active Edge node, resulting in "upstream timed out" errors and application downtime. This occurs because the upstream physical infrastructure continues to route traffic to the previously active (now standby) Edge node.

Environment

  • NSX Edge (T0/T1)

Cause

  • The issue is caused by the upstream physical switch failing to update its MAC address or ARP tables after a failover. When an NSX Edge assumes the "Active" role, it broadcasts Gratuitous ARP (GARP) packets to update the network. If the physical switch ignores or drops these GARP packets—often due to Spanning Tree Protocol (STP) states or security configurations—it retains the stale MAC-to-IP mapping for the VIP. Traffic is "blackholed" until the physical switch's ARP cache naturally expires and forces a new ARP request.

Resolution

  • Ensure the physical switch is configured to accept and process Gratuitous ARP broadcasts.
  • Configure physical switch ports connected to NSX Edges as PortFast or Edge Ports. This ensures the port transitions to the forwarding state immediately, preventing the switch from missing the initial GARP broadcasts sent by the Edge during failover.
  • Review the ARP timeout settings on the upstream router. If the timeout is excessively high, the recovery time from a missed GARP will be significantly longer.
  • Monitor the NSX Edge syslogs to confirm GARP transmission.