After exiting maintenance mode, Edge node went up before tunnels are realized on remote TNs.
book
Article ID: 318330
calendar_today
Updated On:
Products
VMware NSX
Issue/Introduction
Symptoms: VMs N-S traffic dropped during NSX-T Edge Cluster upgrade when upgrading VCF 4.0.2 to 4.2.1 Traffic continued to be blackholed until ARP timeout ( 10 mins) or edge2 finished upgrade
Environment
VMware NSX-T Data Center 3.x VMware NSX-T Data Center
Cause
As part of edge upgrade, edge goes into MM, reboots and exits out of MM. On edge appliance, the nestDB persists the configuration and as soon as edge comes up, tries to establish connectivity with existing config.
Based on the existing configuration in the nestDB, the routing stack came up successfully and it triggered the SR backplane IP to move back to edge1, the original edge that owns the IP. To inform other TNs, 10x GARPs were generated. These GARPs did not reach ESXi TNs as the tunnels to all ESXi TNs were still DOWN and so the traffic continued to be forwarded to Edge2. There was NO traffic impact still and everything worked fine.
Since the edge1 upgrade completed, UC picked last edge in the cluster edge2 to upgrade. This is the same edge that hosted the SR backplane IP of edge1 during its upgrade. The traffic impact started when edge2 went for a reboot as part of upgrade.
Resolution
This issue is resolved in VMware NSX-T Data Center 3.1.3.3
Workaround: Workaround: Pause the upgrade after every edge node is upgraded in the cluster and resume the upgrade after 11 minutes.