Loss of connectivity between VMs using overlay segment along with edge mempool usage crossing threshold at greater than 85%
search cancel

Loss of connectivity between VMs using overlay segment along with edge mempool usage crossing threshold at greater than 85%

book

Article ID: 423578

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are observing intermittent connectivity issues between virtual machines communicating via overlay segments and the impact is primarily observed in east-west traffic
  • Multiple warnings were observed in the NSX Manager related to the active NSX Edge node that is part of the Edge cluster hosting the gateway connected to impacted overlay segments
  • Reported alarm: “The datapath mempool usage for pfstatepl3 on Edge node <UUID> has reached 89%, which is at or above the high threshold value of 85% ”. However, the edge VM memory usage is under 70% and CPU usage is also low.
  • Edge node form factor is already set to the maximum
  • NAT rule/s is configured for the T1/T0 gateway.
  • Upon checking the edge details, a few datapath service showing high memory usages, specifically pfstatepl3
  • This issue may be intermittent and traffic may flow properly for a short period of time throughout the day.  
  • One or more logical routers may show high connection counts with the following command from root shell of affected NSX Edge node
    root#: edge-appctl -t /var/run/vmware/edge/dpd.ctl fw/lr/show total-stats
    [
        {
            "uuid": "<UUID>",
            "vrf": 1,
            "pvi": 3,
            "config-loaded": true,
            "active": true,
            "name": "SR-<Gateway-Name>",
            "type": "SERVICE_ROUTER_TIER0",
            "mp-router-id": "<UUID>",
            "sync-enabled": true,
            "connection-count": 41####4,                <=========== High number of connections
    

Environment

VMware NSX

Cause

  • This is caused by VMs on the segment that is establishing large amount of connections which caused the datapath services to run out of memory to handle these connections.
  • A common scenario is when a Virtual Machine performing excessive network scanning are exhausting connection limits.
  • Additionally, use of traditional NAT rules further increases connection table entries, contributing to the issue.

Resolution

Workaround: Implement reflexive NAT on a gateway

A reflexive NAT does not consume connection entries and therefore shall not deplete the resources.

In addition to implementing reflexive NAT on the gateway, please review below best practices which may help mitigate the issue.

  • Review firewall rules and traffic patterns associated with the identified Tier-1 Service Router UUID
  • Since the Edge node form factor is already set to the maximum, you can try the below:
    • Add an additional Edge node to the Edge cluster
    • Consider configuring gateways in Active/Active mode instead of Active/Standby, if supported by the design.
  • Deploy NSX appliances and Edge nodes on a dedicated ESXi cluster to avoid resource contention.

Additional Information

Similar issue: Edge Datapath mempool usage high alarm