Intermittent East-West Connectivity Drops and HA Failovers due to ESXi CPU Ready Spikes
search cancel

Intermittent East-West Connectivity Drops and HA Failovers due to ESXi CPU Ready Spikes

book

Article ID: 437894

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Intermittent east-west communication loss between two FortiGate VMs residing on the same ESXi host.
  • Repeated High Availability (HA) failovers between the firewall nodes.
  • Packet drops occurring despite traffic remaining local to the host (not traversing the NSX TEP overlay).

Environment

VMware NSX

Cause

The issue is caused by ESXi host CPU oversubscription. High CPU Ready times indicate that the FortiGate vCPUs are waiting for physical CPU scheduling cycles. When a vCPU is starved for cycles, it cannot process incoming network packets from the virtual switch fast enough, leading to packet drops and subsequent HA heartbeat failures.

Validation via the vsish command confirmed that VMXNET3 ring buffers were not exhausted, isolating the drops to CPU scheduling delays rather than buffer overflows:

vsish -e get /net/portsets/<Portset_ID>/ports/<Port_ID>/vmxnet3/rxSummary | grep "1st ring"
vsish -e get /net/portsets/<Portset_ID>/ports/<Port_ID>/vmxnet3/rxSummary | grep "running out of buffers"

Resolution

To remediate host-level resource contention and prevent false-positive HA failovers, perform the following steps:

  1. Configure Resource Reservations:
    • In vCenter, navigate to the FortiGate VMs.
    • Select Edit Settings > VM Options > Resource Allocation.
    • Configure CPU and Memory Reservations to guarantee physical resource availability for the security appliances.
  2. Optimize VM Placement:
    • Identify other high-packet-processing VMs (such as NSX Edge nodes) residing on the same host.
    • Migrate these workloads to different ESXi hosts to reduce localized CPU and network scheduling contention.
  3. Scale Cluster Capacity:
    • Evaluate the overall cluster utilization.
    • If oversubscription persists, introduce additional ESXi hosts to the cluster to distribute the load.

Additional Information

Troubleshooting VM performance