Performance Degradation for VMs on Newly Added Leaf Switches with NSX Edge Nodes on Existing Infrastructure
search cancel

Performance Degradation for VMs on Newly Added Leaf Switches with NSX Edge Nodes on Existing Infrastructure

book

Article ID: 417634

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

After adding new leaf switches to expand VMware infrastructure with NSX, virtual machines running on ESXi hosts connected to the new leaf switches experience severe performance degradation. Symptoms include:

  • VMs on new hosts achieve only a fraction of the throughput that VMs on existing hosts receive when communicating out of the WAN
  • Performance degradation affects only north-south (WAN) traffic
  • East-west traffic between VMs performs normally with expected throughput
  • Issue manifests when Edge nodes remain on ESXi hosts connected to original leaf switches
  • Same VM image performs differently based on which host it runs on

Steps to verify the issue:

  1. Run iperf tests between VMs on same segment - Performance should be normal
  2. Run iperf tests between VMs on different segments (through Tier-1) - Performance should be normal
  3. Run throughput tests to external destinations from VMs on new hosts - Performance will be significantly degraded
  4. Run same throughput test from VMs on original hosts - Performance will be normal
  5. Check Edge node placement - Edge nodes will be on hosts connected to original leaf switches

Environment

  • VMware NSX
  • VMware vSphere ESXi
  • Spine-leaf network architecture
  • Edge nodes deployed on ESXi hosts connected to original leaf switches
  • New ESXi hosts connected to newly added leaf switches

Cause

The performance degradation is likely caused by an underlying issue along the network path, possibly including duplex mismatches, ethernet configuration problems, or VPC issues within the switching infrastructure. When Edge nodes remain on ESXi hosts connected to original leaf switches while compute workloads run on hosts connected to new leaf switches, the asymmetric traffic paths and increased hop count make these issues more apparent.

Traffic path for VMs on original hosts (normal performance):

  • VM → Original Leaf → Edge Host → Tier-0/Tier-1 → Original Leaf → Spine → Border Leaf → External

Traffic path for VMs on new hosts (degraded performance):

  • VM → New Leaf → Spine → Original Leaf → Edge Host → Original Leaf → Spine → Border Leaf → External
  • Return traffic: External → Border Leaf → Spine → Original Leaf → Edge Host → Original Leaf → Spine → New Leaf → VM

The additional traversals through the spine switches and nearly double the hop count amplify any existing network configuration issues. The problem is specifically related to the physical network infrastructure between new and existing leaf switches, not NSX functionality.

Resolution

To isolate and resolve the issue:

1. Confirm NSX is functioning correctly:

  • Deploy a test Edge cluster with both Tier-0 and Tier-1 on an ESXi host connected to the new leaf switches
  • Connect a test VM to a segment attached to this new Tier-1
  • Perform throughput tests from this configuration
  • If performance matches expectations, this confirms the inter-leaf switching path is the root cause

2. Address Edge node placement:

  • Option A: Deploy additional Edge nodes on hosts connected to new leaf switches for symmetric traffic flow
  • Option B: Migrate existing Edge nodes to hosts that provide optimal traffic paths for the majority of workloads

3. Review MTU configuration:

  • Ensure consistent MTU settings across all network components
  • Verify jumbo frames are properly configured end-to-end if in use
  • Reference VMware guidance for proper MTU configuration

4. Address any Edge node resource constraints:

  • Check CPU ready time (%RDY) on Edge nodes - elevated values indicate resource contention
  • Redistribute Edge workloads if necessary to improve quality of service

5. Work with network team to optimize leaf-spine configuration:

  • Review routing configuration between leaf and spine switches
  • Verify proper traffic flow for communication between new and existing leaf switches
  • Validate that asymmetric paths are not causing bottlenecks in the switching infrastructure