DNS Resolution Failures on NSX Edge due to SNAT Port Exhaustion

search cancel

book

calendar_today

VMware NSX

Virtual machines within an NSX environment experience intermittent DNS resolution failures (timeouts).
UDP/53 traffic is dropped, while TCP and ICMP traffic are unaffected.
Edge node CPU utilization remains nominal/low.
Running get firewall <LR-UUID> stats shows reason-state-limit and reason-connection-limit are 0, indicating the Edge appliance has ample hardware resources.
Significant RX Drops are recorded on the Tier-0 or Tier-1 Service Router (SR) uplink interfaces.
Troubleshooting reveals that traffic is funneled through a single internal IP (e.g., an Internal DNS Forwarder or Load Balancer VIP).
The issue often correlates with the deployment or activity of telemetry, security, or EDR agents performing high-frequency DNS lookups across a large number of VMs.

VMware NSX

The issue is caused by reaching the protocol-level limit of a single Source NAT (SNAT) IP address, this can be attributed to the following factors:

Every Source NAT (SNAT) IP is limited to ~64,512 usable ephemeral ports.
When thousands of VMs use a single internal DNS forwarder, the forwarder aggregates those queries.
Automated agents often generate synchronized bursts of DNS queries. If these bursts exceed the 64,512 available ports for that specific SNAT IP, the Edge must drop the overflow.
The datapath fastpath (DPDK) silently discards these overflow packets to protect system resources.

This is a condition that may occur in a VMware NSX environment.

Transition the SNAT configuration from a single IP address to an IP Pool.

In NSX Manager, navigate to Networking > NAT.
Edit the affected SNAT rule.
In the Translated IP field, enter an IP Range or CIDR block (e.g., a /29 subnet).
Each additional IP in the pool provides another 64,512 ephemeral ports, effectively multiplying the translation capacity of the Edge.

thumb_up Yes

thumb_down No