High Datapath CPU Utilization and SNAT Port Exhaustion on NSX Edge Nodes
search cancel

High Datapath CPU Utilization and SNAT Port Exhaustion on NSX Edge Nodes

book

Article ID: 414560

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • An NSX Edge node is experiencing critical performance issues characterized by high datapath CPU utilization and an "SNAT Port Usage On Gateway Is High" alarm.
  • The alarm indicates that SNAT ports for a specific SNAT IP on a logical router have reached a high threshold (80%), leading to potential SNAT port exhaustion.

Alarm details is prescribed below

Description
SNAT ports usage on logical router #######-####-####-####### for SNAT IP x.x.x.x has reached the high threshold value of 80%. 
New flows will not be SNATed when usage reaches the maximum limit. This condition may also lead to high CPU utilization for dataplane CPUs.

Recommended Action
Log in as the admin user on Edge node and invoke the NSX CLI command get firewall <LR_INT_UUID> connection state by using the right interface uuid and check various SNAT mappings for the SNAT IP x.x.x.x
Check traffic flows going through the gateway is not a denial-of-service attack or anomalous burst. If the traffic appears to be within the normal load but the alarm threshold is hit, consider adding more SNAT IP addresses to distribute the load or route new traffic to another Edge node.
  • Logged into the NSX Edge node as user admin and run the command get dataplane cpu stats , the cpu usage on all the core will show approx100% utilization.

CPU Usage
Core      : 0
Usage     : 100%

Core      : 1
Usage     : 100%

Core      : 2
Usage     : 100%

Core      : 3
Usage     : 100%

Core      : 4
Usage     : 100%

Core      : 5
Usage     : 100%

 

Environment

VMware NSX

Cause

High usage of one or more SNAT rule(s) with a single or multiple SNAT IPs and ports, lead to SNAT port exhaustion for the corresponding SNAT IP resulting in high edge node CPU usage.

Resolution

  • Check on the source side to verify why too many connections are hitting the SNAT rules and verify the initiator IP that is hitting on the SNAT rule.
  • For SNAT rules that are expected to handle a high number of simultaneous connections, configure (or update) the rule to use multiple IP addresses instead of just one.
  • As a temporary workaround, we can disabled the specific NAT rule to prevent excessive SNAT rule evaluations and reduce the NSX Edge node CPU utilization.

Additional Information

From the alarm, we get the SNAT IP and using the SNAT IP, we can derive the corresponding SNAT rule

Login to the Edge node as user admin and run the command below

SNAT rule:
To review the SNAT rule from connection table using SNAT IP:

>get firewall <uuid> connection count | find <SNAT-IP>
( uuid = Interface id of the T1/T0 SR uplink where SNAT is configured)

Sample Output

<source-ip>:<port> -> --> <snat-iip>:<port> dir out protocol tcp state SYN_SENT:CLOSED fn 0:<SNAT-RULE-ID>

To check the details of SNAT rule:

>get firewall <UUID> ruleset type snat

To check the hits on the SNAT rule:

> get firewall <UUID> ruleset type snat stats | more

    Rule ID             : <SNAT-RULE-ID>
    Input bytes         : 0
    Output bytes        : 617425488
    Input packets       : 0
    Output packets      : 10290892
    Evaluations         : 26757132
    Hits                : 5198351 -----------> Number hits on the specific SNAT RULE ID.
    Active connections  : 96618 -----------> Number active connections on the specific SNAT RULE ID.