After a planned maintenance window to replace or upgrade physical core switches connected to an NSX environment, access devices are unable to reach production VLANs. Symptoms include:
These symptoms may initially appear to be NSX-related, but the underlying cause is often in the physical switching infrastructure rather than NSX itself. This article provides a methodology to isolate whether the issue resides in NSX or the physical network.
The physical switching infrastructure has an underlying connectivity issue introduced during or after the maintenance window. Common causes include:
NSX tunnel failures are a downstream symptom of this underlying physical connectivity instability, not the root cause.
Use the following process of elimination to determine whether the issue is in NSX or the physical switching infrastructure:
Step 1: Test management network connectivity from ESXi hosts
SSH to an affected ESXi host and ping the local gateway for the management network:
ping <management_gateway_IP>
If the management network is on a standard vSphere Distributed Switch (VDS) and not NSX-managed, intermittent ping failures indicate a physical network issue rather than an NSX issue. Management traffic does not traverse NSX, so failures here point upstream.
Step 2: Test connectivity from the physical switches
Access the core switches and test connectivity to the same gateway or upstream devices. If connectivity is intermittent or failing from the switch itself, the problem is in the physical switching infrastructure or further upstream.
Step 3: Check for HSRP/VRRP state changes
Verify whether the HSRP or VRRP primary and secondary roles have changed unexpectedly after the maintenance. An unplanned failover can indicate underlying issues with the switching configuration or connectivity between switch peers.
Step 4: Verify the issue persists regardless of the switches in place
If the same configuration was applied to new switches and the issue persists even after reverting to the original switches, this suggests a deeper infrastructure or capability issue rather than a simple misconfiguration during migration.
Step 5: Isolate NIC teaming as a variable
Temporarily disable NIC teaming on both the ESXi hosts and the physical switches to rule out teaming-related issues. For guidance, see: How to configure NIC teaming in ESXi and ESX
Step 6: Capture packets at the ESXi uplink layer
Run packet captures on the ESXi host using the pktcap-uw tool to observe traffic at the vmnic (uplink) level. This helps determine whether traffic is leaving the host but not returning, which would indicate a physical network issue. For guidance, see: Using the pktcap-uw tool in ESXi
Example command to capture on vmnic0:
pktcap-uw --uplink vmnic0 -o /tmp/vmnic0_capture.pcap
Step 7: Engage the switch vendor if physical infrastructure is identified as the cause
If the above steps indicate the issue is in the physical switching infrastructure, engage the switch vendor support team to troubleshoot VPC, HSRP/VRRP, STP, or trunk configurations. Once physical connectivity is stabilized, NSX tunnels recover automatically.