NSX one edge takes most of the traffic for over 10 minutes in T0 stateless A/A setup
search cancel

NSX one edge takes most of the traffic for over 10 minutes in T0 stateless A/A setup

book

Article ID: 382020

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • All T0-SRs on all Edges are Active, but one T0-SR on one edge takes most of the traffic.
  • The edge taking most of the traffic experienced BFD flaps to other T0-SR edges.
  • The load redistributed back normal across all T0-SR edges after a few minutes (can range from 1~10 minutes depending on when the next broadcast ARP request was triggered).

If the traffic load is higher than one edge can handle, the customer will start experiencing packet drops until the traffic load is redistributed across all T0-SR edges.

Environment

NSX 

Cause

When one of the edges experiences BFD flaps with other edges, it considers other edges unhealthy and, therefore, takes over other T0-SRs' backplane IP and announces GARP for those IPs. This causes most ECMP traffic to be attracted to this edge. Other T0-SRs on other edges do not know their IPs have been taken over and do not do split-brain healing (i.e., GARP). Therefore, traffic will continue to this edge even after BFDs come back up. ECMP traffic is only redistributed back to other edges when the other TN (ESX) broadcasts an ARP request for the backplane IP, which happens every 10 minutes.

Resolution

None. No workaround either.  The problem recovers within 10 minutes automatically.