NSX Edge node experiences significant performance degradation characterized by high CPU utilization spikes. This behavior is typically observed in the following states:
unixctl|WARN|failed to connect to /var/run/vmware/edge/dpd.ctl.mega_flow_cache_hit_rate_low, indicating a decrease in Flow Cache hit rates while CPU usage remains high.VMware NSX
The issue is caused by high Datapath CPU utilization which prevents timely processing of flow-cache entries and management communication. This is often driven by:
High Flow Creation: A continuous stream of new flows forcing the first packet of each flow into the "slow path" for cache setup.
Resource Contention: Physical CPU contention or spikes on the underlying ESXi host where the Edge VM resides.
To resolve this issue, follow these steps to scale the environment or address host-level constraints:
Monitor Datapath Memory: Log in as the root user on the Edge CLI and check DPDK memory and mempool usage:
Evaluate Underlying Host: Check the ESXi host for CPU ready time or physical resource exhaustion. If the host is experiencing spikes, migrate the Edge VM to a host with more available resources.
Scale Edge Resources: If the high CPU usage is consistent due to traffic volume:
Increase the Edge appliance size (e.g., from Large to Extra Large).
Increase the number of Edge nodes in an Active/Active Gateway cluster to better distribute the load.
For persistent host-level CPU contention, please engage your ESXi host management team.