NSX Edge Performance Issues and High CPU on Datapath and BGP
search cancel

NSX Edge Performance Issues and High CPU on Datapath and BGP

book

Article ID: 440105

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX Edge node experiences significant performance degradation characterized by high CPU utilization spikes. This behavior is typically observed in the following states:

  • The datapathd process consumes excessive CPU (e.g., >200%), impacting the fast-path processing.
  • The bgpd process shows high CPU usage (e.g., >190%), affecting routing stability.
  • Management agents fail to communicate with the datapath, resulting in the following error in syslog:
    unixctl|WARN|failed to connect to /var/run/vmware/edge/dpd.ctl.
  • he Edge Health monitor reports Event ID : mega_flow_cache_hit_rate_low, indicating a decrease in Flow Cache hit rates while CPU usage remains high.

Environment

VMware NSX

Cause

The issue is caused by high Datapath CPU utilization which prevents timely processing of flow-cache entries and management communication. This is often driven by:

  1. High Flow Creation: A continuous stream of new flows forcing the first packet of each flow into the "slow path" for cache setup.

  2. Resource Contention: Physical CPU contention or spikes on the underlying ESXi host where the Edge VM resides.

Resolution

To resolve this issue, follow these steps to scale the environment or address host-level constraints:

  1. Monitor Datapath Memory: Log in as the root user on the Edge CLI and check DPDK memory and mempool usage:

  2. Evaluate Underlying Host: Check the ESXi host for CPU ready time or physical resource exhaustion. If the host is experiencing spikes, migrate the Edge VM to a host with more available resources.

  3. Scale Edge Resources: If the high CPU usage is consistent due to traffic volume:

    • Increase the Edge appliance size (e.g., from Large to Extra Large).

    • Increase the number of Edge nodes in an Active/Active Gateway cluster to better distribute the load.

Additional Information

For persistent host-level CPU contention, please engage your ESXi host management team.