NSX tunnel failures and connectivity loss after physical switch replacement or maintenance
search cancel

NSX tunnel failures and connectivity loss after physical switch replacement or maintenance

book

Article ID: 421785

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

After a planned maintenance window to replace or upgrade physical core switches connected to an NSX environment, access devices are unable to reach production VLANs. Symptoms include:

  • NSX tunnel failures or tunnel status showing as down
  • No ARP requests returning in the virtual environment
  • Intermittent or complete loss of connectivity to VMs on NSX segments
  • ESXi hosts showing disconnected or not responding in vCenter

These symptoms may initially appear to be NSX-related, but the underlying cause is often in the physical switching infrastructure rather than NSX itself. This article provides a methodology to isolate whether the issue resides in NSX or the physical network.

Environment

  • VMware NSX
  • VMware vSphere ESXi
  • Physical switch replacement, upgrade, or configuration change (any vendor)

Cause

The physical switching infrastructure has an underlying connectivity issue introduced during or after the maintenance window. Common causes include:

  • Virtual Port Channel (VPC) or Multi-Chassis Link Aggregation (MLAG) misconfiguration
  • Hot Standby Router Protocol (HSRP) or Virtual Router Redundancy Protocol (VRRP) failover issues where the secondary has become primary unexpectedly
  • Spanning Tree Protocol (STP) reconvergence issues
  • Trunk or VLAN configuration discrepancies on the new switches
  • Physical cabling or port channel membership issues

NSX tunnel failures are a downstream symptom of this underlying physical connectivity instability, not the root cause.

Resolution

Use the following process of elimination to determine whether the issue is in NSX or the physical switching infrastructure:

Step 1: Test management network connectivity from ESXi hosts

SSH to an affected ESXi host and ping the local gateway for the management network:

ping <management_gateway_IP>

If the management network is on a standard vSphere Distributed Switch (VDS) and not NSX-managed, intermittent ping failures indicate a physical network issue rather than an NSX issue. Management traffic does not traverse NSX, so failures here point upstream.

Step 2: Test connectivity from the physical switches

Access the core switches and test connectivity to the same gateway or upstream devices. If connectivity is intermittent or failing from the switch itself, the problem is in the physical switching infrastructure or further upstream.

Step 3: Check for HSRP/VRRP state changes

Verify whether the HSRP or VRRP primary and secondary roles have changed unexpectedly after the maintenance. An unplanned failover can indicate underlying issues with the switching configuration or connectivity between switch peers.

Step 4: Verify the issue persists regardless of the switches in place

If the same configuration was applied to new switches and the issue persists even after reverting to the original switches, this suggests a deeper infrastructure or capability issue rather than a simple misconfiguration during migration.

Step 5: Isolate NIC teaming as a variable

Temporarily disable NIC teaming on both the ESXi hosts and the physical switches to rule out teaming-related issues. For guidance, see: How to configure NIC teaming in ESXi and ESX

Step 6: Capture packets at the ESXi uplink layer

Run packet captures on the ESXi host using the pktcap-uw tool to observe traffic at the vmnic (uplink) level. This helps determine whether traffic is leaving the host but not returning, which would indicate a physical network issue. For guidance, see: Using the pktcap-uw tool in ESXi

Example command to capture on vmnic0:

pktcap-uw --uplink vmnic0 -o /tmp/vmnic0_capture.pcap

Step 7: Engage the switch vendor if physical infrastructure is identified as the cause

If the above steps indicate the issue is in the physical switching infrastructure, engage the switch vendor support team to troubleshoot VPC, HSRP/VRRP, STP, or trunk configurations. Once physical connectivity is stabilized, NSX tunnels recover automatically.