NSX tunnel failures and connectivity loss after physical switch replacement or maintenance

Products

VMware NSX

Issue/Introduction

After a planned maintenance window to replace or upgrade physical core switches connected to an NSX environment, access devices are unable to reach production VLANs. Symptoms include:

NSX tunnel failures or tunnel status showing as down
No ARP requests returning in the virtual environment
Intermittent or complete loss of connectivity to VMs on NSX segments
ESXi hosts showing disconnected or not responding in vCenter

These symptoms may initially appear to be NSX-related, but the underlying cause is often in the physical switching infrastructure rather than NSX itself. This article provides a methodology to isolate whether the issue resides in NSX or the physical network.

Environment

VMware NSX
VMware vSphere ESXi
Physical switch replacement, upgrade, or configuration change (any vendor)

Cause

The physical switching infrastructure has an underlying connectivity issue introduced during or after the maintenance window. Common causes include:

Virtual Port Channel (VPC) or Multi-Chassis Link Aggregation (MLAG) misconfiguration
Hot Standby Router Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP) failover issues where the secondary has become primary unexpectedly
Spanning Tree Protocol (STP) reconvergence issues
Trunk or VLAN configuration discrepancies on the new switches
Physical cabling or port channel membership issues

NSX tunnel failures are a downstream symptom of this underlying physical connectivity instability, not the root cause.

Resolution

Use the following process of elimination to determine whether the issue is in NSX or the physical switching infrastructure:

Step 1: Test management network connectivity from ESXi hosts

SSH to an affected ESXi host and ping the local gateway for the management network:

ping <management_gateway_IP>

If the management network is on a standard vSphere Distributed Switch (VDS) and not NSX-managed, intermittent ping failures indicate a physical network issue rather than an NSX issue. Management traffic does not traverse NSX, so failures here point upstream.

Step 2: Test connectivity from the physical switches

Access the core switches and test connectivity to the same gateway or upstream devices. If connectivity is intermittent or failing from the switch itself, the problem is in the physical switching infrastructure or further upstream.

Step 3: Check for HSRP/VRRP state changes

Verify whether the HSRP or VRRP primary and secondary roles have changed unexpectedly after the maintenance. An unplanned failover can indicate underlying issues with the switching configuration or connectivity between switch peers.

Step 4: Verify the issue persists regardless of the switches in place

If the same configuration was applied to new switches and the issue persists even after reverting to the original switches, this suggests a deeper infrastructure or capability issue rather than a simple misconfiguration during migration.

Step 5: Isolate NIC teaming as a variable

Temporarily disable NIC teaming on both the ESXi hosts and the physical switches to rule out teaming-related issues. For guidance, see: How to configure NIC teaming in ESXi and ESX

Step 6: Capture packets at the ESXi uplink layer

Run packet captures on the ESXi host using the pktcap-uw tool to observe traffic at the vmnic (uplink) level. This helps determine whether traffic is leaving the host but not returning, which would indicate a physical network issue. For guidance, see: Using the pktcap-uw tool in ESXi

Example command to capture on vmnic0:

pktcap-uw --uplink vmnic0 -o /tmp/vmnic0_capture.pcap

Step 7: Engage the switch vendor if physical infrastructure is identified as the cause

If the above steps indicate the issue is in the physical switching infrastructure, engage the switch vendor support team to troubleshoot VPC, HSRP/VRRP, STP, or trunk configurations. Once physical connectivity is stabilized, NSX tunnels recover automatically.