Troubleshooting TEP IP Pool Exhaustion and Duplicate Assignments

Products

VMware NSX

Issue/Introduction

You may observe one of the following symptoms related to IP Pool management:

Tunnels Down (BFD Session Failure):

Overlay connectivity between Transport Nodes fails shortly after or during new host or transport node provisioning.
Running the command nsxdp-cli bfd sessions list on the host or Edge shows sessions as DOWN.
IP Pools are in use for TEPs.

Unable to Assign IP (Installation Failure):

New Transport Node (Host or Edge) installation or configuration fails.
You receive error messages indicating that the IP Pool is full or no free IP addresses are available, even though you have recently deleted nodes and believe there should be capacity.
The IP Pool usage count in the UI remains high despite node deletions.

Environment

VMware NSX-T Data Center
VMware NSX

Cause

In VMware NSX-T, IP Pools act as a centralised tracker for assigning Tunnel Endpoint (TEP) IP addresses to Transport Nodes (ESXi Hosts and Edge Nodes). When a node is configured, the NSX Manager allocates an IP from the pool. This ensures that every node has a unique IP. Issues arise when this mapping falls out of sync with the reality of the network. Either thinking an IP is in use when it isn't (Exhaustion), or thinking an IP is free when it is actually in use (Duplication).

NSX IP Pools function as a passive database for IP tracking; they do not actively monitor network traffic or IP reachability. State changes rely entirely on successful triggers from provisioning workflows (such as node addition or removal). If a workflow fails or is bypassed before the allocation or release step completes, the IP Pool database is not updated. The system does not automatically reconcile these discrepancies, leaving the IP in an incorrect state.

Resolution

Important: Before proceeding with the steps below, review the known issues listed in the Additional Information section. If the underlying cause (such as a specific software issue or incorrect workflow) is not addressed, the pool synchronization issue may recur after you fix the immediate symptom, the same can be said if there was a historical issue you may see symptoms even though the original cause is fixed, until the IP Pools are re-aligned manually via the below steps.

IP Allocation Audit

Perform an IP Allocation Audit to verify consistency between the IP pool and the actual IP usage on the network. This comparison will confirm which resolution scenario applies to your environment.

Retrieve Pool Allocations
- View IP pool allocations in the NSX GUI (Please see KB - NSX-T UI / API doesn't show information about IP pool allocations.)
- Alternatively Execute the following API call to list all IPs currently allocated in the pool according to the NSX Manager: GET https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>/allocations
  - If you are unsure of the pool ID then use GET https://<nsx-mgr-ip>/api/v1/pools/ip-pools/
  - Note the list of IP addresses and their allocation_id.
    Sample Output:
    {
    "results": [
    {
    "allocation_id": "192.168.141.11",
    "_protection": "NOT_PROTECTED"
    }
    ],
    "result_count": 1
    }
Retrieve Node TEPs and Identify the active TEP IP addresses actually configured on your Transport Nodes.
- - - Via NSX GUI: Navigate to System > Fabric > Hosts (for ESXi/KVM) and System > Fabric > Nodes (for Edges) and take note of the TEP assignments to each node.
    - Via CLI (ESXi): Run esxcfg-vmknic -l and take note the IPs of the vxlan configured VMKs.
Compare and Identify Discrepancies Compare the list from Step 1 (Management Plane) with the list from Step 2 (Physical) to determine your scenario:

- Scenario A: An IP exists in the API list (Step 1) but is NOT configured on any Transport Node (Step 2).
  This is a "Zombie Allocation" (IP Pool Exhaustion). Proceed to Resolution Scenario 1.
  
  Scenario B: An IP exists on a Transport Node (Step 2) but is missing from the API list (Step 1).
  This is a "Ghost Allocation" (The Management Plane thinks it is free, but it is physically taken). Proceed to Resolution Scenario 2.
- Scenario C: An IP does not exist on the API list or the TEP list but when a host was deployed with it, you hit tunnel down issues.
  This is a physical allocation, the IP has likely been assigned manually to something else in the environment not via IP Pools and hence NSX has no awareness of its use.

Resolution

Apply the resolution matching the scenario identified during the Audit.

Scenario A: IP Pool Exhaustion (IPs Not Releasing)

You identified IPs in the API allocation list that do not exist on any physical node (Scenario A).

Identify Stale Allocations Using the API output from the Diagnosis step, identify the allocation_id (IP address) that corresponds to the deleted or missing node.
Manually Release IPs Use the following API to release the stale IP address from the IP Pool: POST https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>>?action=RELEASE
Request Body:

{
"allocation_id": "<ip-to-release>"
}
Verify Release by re-running the GET API from Step 1 of the audit to verify that the allocation is no longer present in the list.

Scenario B: IP Pool / TEP duplication

You identified IPs in use on nodes that are not marked as allocated in the API list (Scenario B).

Manually Allocate IPs to the pool to prevent NSX from handing it out to another node by running POST https://<nsx-mgr-ip>/api/v1/pools/ip-pools/<Pool-ID>?action=ALLOCATE
Request Body:

{
"allocation_id": "<ip-physically-in-use>"
}
Verify Allocation by rerunning GET API from Step 1 to verify that the allocation has been successfully applied. This ensures the IP is now correctly tracked as "in use" by the system.

Scenario C: IP Pool Aligned but duplication present on deployment

If the IP pool audit showed alignment but you are still seeing symptoms that point to duplication of IPs when hosts are deployed it is likely a physical device has been assigned the IP that NSX is not aware of. Packet captures and traces will likely be needed to identify what is the source of the duplication as it is external to NSX.