After exiting maintenance mode, Edge node went up before tunnels are realized on remote TNs.
search cancel

After exiting maintenance mode, Edge node went up before tunnels are realized on remote TNs.

book

Article ID: 318330

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
VMs N-S traffic dropped during NSX-T Edge Cluster upgrade when upgrading VCF 4.0.2 to 4.2.1
Traffic continued to be blackholed until ARP timeout ( 10 mins) or edge2 finished upgrade

Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Cause

As part of edge upgrade, edge goes into MM, reboots and exits out of MM. On edge appliance, the nestDB persists the configuration and as soon as edge comes up, tries to establish connectivity with existing config. 

Based on the existing configuration in the nestDB, the routing stack came up successfully and it triggered the SR backplane IP to move back to edge1, the original edge that owns the IP. To inform other TNs, 10x GARPs were generated. These GARPs did not reach ESXi TNs as the tunnels to all ESXi TNs were still DOWN and so the traffic continued to be forwarded to Edge2. There was NO traffic impact still and everything worked fine.

Since the edge1 upgrade completed, UC picked last edge in the cluster edge2 to upgrade. This is the same edge that hosted the SR backplane IP of edge1 during its upgrade. The traffic impact started when edge2 went for a reboot as part of upgrade.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.1.3.3

Workaround:
Workaround:  Pause the upgrade after every edge node is upgraded in the cluster and resume the upgrade after 11 minutes.