VM network connectivity issues after NSX Manager upgrade

Products

VMware NSX

Issue/Introduction

After upgrading NSX Manager nodes, hosts in one or more clusters may experience degraded status, loss of VM connectivity, and NSX overlay tunnels will be down .
Unexpected changes in VTEP (Virtual Tunnel End Point) labels are observed on Transport Nodes, after the manage upgrade. This may lead to network disruptions and loss of VM connectivity.
Logs on the ESXi host (/var/run/log/nsx-syslog.log) show updates to the TN_TRANSPORT_SWITCHES configuration, indicating a change in the vtep_label for vmknic interfaces.
Log lines similar to the below are encountered on the NSX-managed ESXi host:

YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[XXXXXXX]: NSX XXXXXXX - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="XXXXXXX" level="info"] ConfigCache: Update TN_TRANSPORT_SWITCHES old config transport_switch { transport_switch_name: "XX XX XX XX XX XX XX XX-XX XX XX XX XX XX XX XX" ... vtep_label { dev_name: "vmkXX" label: XXXXX } ... } new config transport_switch { transport_switch_name: "YY YY YY YY YY YY YY YY-YY YY YY YY YY YY YY YY" ... vtep_label { dev_name: "vmkXX" label: YYYY } ... }

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment. The displayed vtep_label value will change between the "old config" and "new config" entries. Notice how there is an update in the switch configuration, switch name is being updated and new labels are created for vmk.

Environment

VMware NSX

Cause

This issue is caused by the NSX Manager's internal logic for managing Virtual Tunnel End Point (VTEP) configurations on transport nodes, specifically when a vSphere Distributed Switch (VDS) has been renamed in vCenter prior to the NSX Manager upgrade.
During the NSX Manager upgrade process, a Transport Node update operation is triggered. NSX relies on the Distributed Switch name (in addition to its unique ID) to determine if it should re-use existing VTEP configurations.
If the DVS name has been changed in vCenter, NSX Manager misinterprets this as a new host switch, even if the underlying DVS UUID remains the same. This forces NSX to de-provision the old VTEP labels and allocate entirely new ones.

Resolution

The issue has been resolved in NSX versions 4.2.3.2, 4.2.4, 9.0.2 and 9.1.

Workaround:

For each ESXi host, identify the VTEP vmknic interfaces (e.g., vmk10, vmk11).
Retrieve the current gateway IPs for each host impacted using the command provided below:

[root@esx-01:~]esxcli network ip interface ipv4 get -i vmk10

[root@esx-01:~]esxcli network ip interface ipv4 get -i vmk11
Explicitly set the gateway IP address, even if it appears correct, using net-vdl2 for each VTEP. This command populates an internal cache that helps restore the gateway IP even if VTEP labels are updated.

[root@esx-01:~] net-vdl2 -G gwIP -s <dvs> -k vmk10 -x <gwIP>

[root@esx-01:~] net-vdl2 -G gwIP -s <dvs> -k vmk11 -x <gwIP>

Note: Upgrading to other versions, where the fix is not applied, after having the VDS renamed will trigger the issue. To prevent any downtime, explicitly set the GW using the vdl2 commands after the upgrade.

Additional Information

Rebooting the ESXi hosts also resolves the issue.