NSX Edge Node High CPU Utilization on Datapath and BGP

search cancel

NSX Edge Node High CPU Utilization on Datapath and BGP

book

Article ID: 440105

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

An NSX Edge node experiences significant performance degradation characterized by high CPU utilization spikes.

The datapathd process consumes excessive CPU (e.g., >200%), impacting fast-path processing, while the bgpd process shows high CPU usage (e.g., >190%), affecting routing stability. Management agents fail to communicate with the datapath, resulting in the following error in the syslog.log file. "unixctl|WARN|failed to connect to /var/run/vmware/edge/dpd.ctl"

The Edge Health monitor reports Event ID mega_flow_cache_hit_rate_low, indicating a decrease in Mega Flow Cache hit rates while CPU usage remains high.

Environment

VMware NSX

Cause

The issue is caused by high Datapath CPU utilization which prevents timely processing of flow-cache entries and management communication.

This is driven by high flow creation where a continuous stream of new flows forces the first packet of each flow into the slow path for cache setup, or by physical CPU contention and spikes on the underlying ESXi host where the Edge VM resides.

Resolution

To resolve this issue, perform the following steps to isolate and mitigate the performance constraints:

Perform Host-Level Evaluation Check the underlying ESXi host for CPU ready time or physical resource exhaustion. If the host is experiencing high resource contention, use the vSphere Client to migrate the Edge VM to a host with dedicated, unconstrained resources. Avoid performing unnecessary live Edge vMotion operations for troubleshooting unless the node has been gracefully placed into NSX maintenance mode first to prevent North-South connectivity disruption.
Check DPDK Metrics Log in as the root user on the Edge CLI and check DPDK memory and mempool usage by running:

get datapath memory

Validate that the memory pools are not exhausted.
Execute the clear command on the Edge node.

clear edge-datapath flowcache
Scale Edge Resources If high CPU usage is consistently driven by traffic volume and new connection rates, scale the infrastructure:

Increase the Edge appliance form factor sizing (e.g., from Large to Extra Large).
Deploy additional Edge nodes and expand the Active/Active Gateway tier cluster to better distribute datapath load.

Additional Information

For detailed safety procedures regarding Edge placement and maintenance, refer to the NSX Edge vMotion Best Practices guide.

Feedback

thumb_up Yes

thumb_down No