Troubleshooting TEP IP Pool Exhaustion and Duplicate Assignments
search cancel

Troubleshooting TEP IP Pool Exhaustion and Duplicate Assignments

book

Article ID: 423142

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

You may observe one of the following symptoms related to IP Pool management:

Tunnels Down (BFD Session Failure):

  • Overlay connectivity between Transport Nodes fails shortly after or during new host or transport node provisioning.
  • Running the command nsxdp-cli bfd sessions list on the host or Edge shows sessions as DOWN.
  • IP Pools are in use for TEPs.

Unable to Assign IP (Installation Failure):

  • New Transport Node (Host or Edge) installation or configuration fails.
  • You receive error messages indicating that the IP Pool is full or no free IP addresses are available, even though you have recently deleted nodes and believe there should be capacity.
  • The IP Pool usage count in the UI remains high despite node deletions.

 

Environment

VMware NSX-T Data Center
VMware NSX

Cause

In VMware NSX-T, IP Pools act as a centralised tracker for assigning Tunnel Endpoint (TEP) IP addresses to Transport Nodes (ESXi Hosts and Edge Nodes). When a node is configured, the NSX Manager allocates an IP from the pool. This ensures that every node has a unique IP. Issues arise when this  mapping falls out of sync with the reality of the network. Either thinking an IP is in use when it isn't (Exhaustion), or thinking an IP is free when it is actually in use (Duplication).

NSX IP Pools function as a passive database for IP tracking; they do not actively monitor network traffic or IP reachability. State changes rely entirely on successful triggers from provisioning workflows (such as node addition or removal). If a workflow fails or is bypassed before the allocation or release step completes, the IP Pool database is not updated. The system does not automatically reconcile these discrepancies, leaving the IP in an incorrect state.

 

Resolution

Important: Before proceeding with the steps below, review the known issues listed in the Additional Information section. If the underlying cause (such as a specific software issue or incorrect workflow) is not addressed, the pool synchronization issue may recur after you fix the immediate symptom, the same can be said if there was a historical issue you may see symptoms even though the original cause is fixed, until the IP Pools are re-aligned manually via the below steps.

IP Allocation Audit

Perform an IP Allocation Audit to verify consistency between the IP pool and the actual IP usage on the network. This comparison will confirm which resolution scenario applies to your environment.

  1.  Retrieve Pool Allocations
    • View IP pool allocations in the NSX GUI (Please see KB - NSX-T UI / API doesn't show information about IP pool allocations.)
    • Alternatively Execute the following API call to list all IPs currently allocated in the pool according to the NSX Manager: GET https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>/allocations 
      • If you are unsure of the pool ID then use GET https://<nsx-mgr-ip>/api/v1/pools/ip-pools/
      • Note the list of IP addresses and their allocation_id.
        Sample Output:
        {
            "results": [
                {
                    "allocation_id": "192.168.141.11",
                    "_protection": "NOT_PROTECTED"
                }
            ],
            "result_count": 1
        }

  2. Retrieve Node TEPs and  Identify the active TEP IP addresses actually configured on your Transport Nodes.
        • Via NSX  GUI: Navigate to System > Fabric > Hosts (for ESXi/KVM) and System > Fabric > Nodes (for Edges) and take note of the TEP assignments to each node.
        • Via CLI (ESXi): Run esxcfg-vmknic -l and take note the IPs of the vxlan configured VMKs.

  3. Compare and Identify Discrepancies Compare the list from Step 1 (Management Plane) with the list from Step 2 (Physical) to determine your scenario:
    • Scenario A: An IP exists in the API list (Step 1) but is NOT configured on any Transport Node (Step 2).
      This is a "Zombie Allocation" (IP Pool Exhaustion). Proceed to Resolution Scenario 1.

      Scenario B: An IP exists on a Transport Node (Step 2) but is missing from the API list (Step 1).
      This is a "Ghost Allocation" (The Management Plane thinks it is free, but it is physically taken). Proceed to Resolution Scenario 2.

    • Scenario C: An IP does not exist on the API list or the TEP list but when a host was deployed with it, you hit tunnel down issues.
      This is a physical allocation, the IP has likely been assigned manually to something else in the environment not via IP Pools and hence NSX has no awareness of its use.

Resolution

Apply the resolution matching the scenario identified during the Audit.

Scenario A: IP Pool Exhaustion (IPs Not Releasing)

You identified IPs in the API allocation list that do not exist on any physical node (Scenario A).

  1.  Identify Stale Allocations Using the API output from the Diagnosis step, identify the allocation_id (IP address) that corresponds to the deleted or missing node.
  2. Manually Release IPs Use the following API to release the stale IP address from the IP Pool: POST https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>>?action=RELEASE
    Request Body:

    {
       "allocation_id": "<ip-to-release>"
    }

  3. Verify Release by re-running the GET API from Step 1 of the audit to verify that the allocation is no longer present in the list.

Scenario B: IP Pool / TEP duplication 

You identified IPs in use on nodes that are not marked as allocated in the API list (Scenario B).

  1. Manually Allocate IPs  to the pool to prevent NSX from handing it out to another node by running POST https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>?action=ALLOCATE
    Request Body:

    {
       "allocation_id": "<ip-physically-in-use>"
    }

  2. Verify Allocation by rerunning GET API from Step 1 to verify that the allocation has been successfully applied. This ensures the IP is now correctly tracked as "in use" by the system.
 

Scenario C: IP Pool Aligned but duplication present on deployment

  1. If the IP pool audit showed alignment but you are still seeing symptoms that point to duplication of IPs when hosts are deployed it is likely a physical device has been assigned the IP that NSX is not aware of. Packet captures and traces will likely be needed to identify what is the source of the duplication as it is external to NSX.

 

Additional Information

Potential known issues that can cause this desynchronization include:

KB 322584: TEP IP Addresses Not Released After FORCE Deleting Host/Edge Transport Node
KB 390030: Resolving Duplicate IP Assignment Issues When adding a new Host to an NSX Prepared Cluster
KB 322041: Duplicated TEP IP assignment in the Edge node
KB 380147: BFD tunnel down between transport nodes due to NSX T assigned duplicate TEP IP



Admin Guide - Add an NSX IP Address Pool

Broadcom Developer Portal: NSX-T Manager API - IP Pools