Performance Degradation for VMs on Newly Added Leaf Switches with NSX Edge Nodes on Existing Infrastructure

Products

VMware NSX

Issue/Introduction

After adding new leaf switches to expand VMware infrastructure with NSX, virtual machines running on ESXi hosts connected to the new leaf switches experience severe performance degradation. Symptoms include:

VMs on new hosts achieve only a fraction of the throughput that VMs on existing hosts receive when communicating out of the WAN
Performance degradation affects only north-south (WAN) traffic
East-west traffic between VMs performs normally with expected throughput
Issue manifests when Edge nodes remain on ESXi hosts connected to original leaf switches
Same VM image performs differently based on which host it runs on

Steps to verify the issue:

Run iperf tests between VMs on same segment - Performance should be normal
Run iperf tests between VMs on different segments (through Tier-1) - Performance should be normal
Run throughput tests to external destinations from VMs on new hosts - Performance will be significantly degraded
Run same throughput test from VMs on original hosts - Performance will be normal
Check Edge node placement - Edge nodes will be on hosts connected to original leaf switches

Environment

VMware NSX
VMware vSphere ESXi
Spine-leaf network architecture
Edge nodes deployed on ESXi hosts connected to original leaf switches
New ESXi hosts connected to newly added leaf switches

Cause

The performance degradation is likely caused by an underlying issue along the network path, possibly including duplex mismatches, ethernet configuration problems, or VPC issues within the switching infrastructure. When Edge nodes remain on ESXi hosts connected to original leaf switches while compute workloads run on hosts connected to new leaf switches, the asymmetric traffic paths and increased hop count make these issues more apparent.

Traffic path for VMs on original hosts (normal performance):

VM → Original Leaf → Edge Host → Tier-0/Tier-1 → Original Leaf → Spine → Border Leaf → External

Traffic path for VMs on new hosts (degraded performance):

VM → New Leaf → Spine → Original Leaf → Edge Host → Original Leaf → Spine → Border Leaf → External
Return traffic: External → Border Leaf → Spine → Original Leaf → Edge Host → Original Leaf → Spine → New Leaf → VM

The additional traversals through the spine switches and nearly double the hop count amplify any existing network configuration issues. The problem is specifically related to the physical network infrastructure between new and existing leaf switches, not NSX functionality.

Resolution

To isolate and resolve the issue:

1. Confirm NSX is functioning correctly:

Deploy a test Edge cluster with both Tier-0 and Tier-1 on an ESXi host connected to the new leaf switches
Connect a test VM to a segment attached to this new Tier-1
Perform throughput tests from this configuration
If performance matches expectations, this confirms the inter-leaf switching path is the root cause

2. Address Edge node placement:

Option A: Deploy additional Edge nodes on hosts connected to new leaf switches for symmetric traffic flow
Option B: Migrate existing Edge nodes to hosts that provide optimal traffic paths for the majority of workloads

3. Review MTU configuration:

Ensure consistent MTU settings across all network components
Verify jumbo frames are properly configured end-to-end if in use
Reference VMware guidance for proper MTU configuration

4. Address any Edge node resource constraints:

Check CPU ready time (%RDY) on Edge nodes - elevated values indicate resource contention
Redistribute Edge workloads if necessary to improve quality of service

5. Work with network team to optimize leaf-spine configuration:

Review routing configuration between leaf and spine switches
Verify proper traffic flow for communication between new and existing leaf switches
Validate that asymmetric paths are not causing bottlenecks in the switching infrastructure