NSX-T Critical alarm indicates: LB Edge Capacity In Use High

Products

VMware NSX

Issue/Introduction

In an NSX-T deployment where Edge nodes (Small, Medium, Large, XLarge, or Bare Metal) are configured with Load Balancer services, a warning alarm may be triggered when the load balancer capacity usage becomes high.

Feature: LB Edge Capacity In Use High.
Description: The usage of the load balancer service in Edge node {entity_id} is high. The usage is {lb_capacity_usage}%. The threshold is {system_usage_threshold}%. Default threshold is 80%.

Purpose: This alarm is generated to proactively alert administrators when the load balancer capacity usage on an Edge node reaches or exceeds 80% of the node’s total load balancer capacity. It serves as an early warning mechanism to help prevent resource exhaustion and potential deployment failures. When usage reaches 80%, it indicates that the Edge node is approaching its maximum load balancer capacity, leaving limited headroom for additional load balancer instances or future scaling requirements.
Impact: If the load balancer capacity usage becomes too high, there may be insufficient resources available to deploy new load balancer services or to scale existing ones. This can result in failures when attempting to create additional load balancer instances, potentially affecting application availability and business operations. The alarm threshold is intentionally set at 80% to provide administrators with sufficient time to plan and implement remediation actions before the hard capacity limit is reached.

Note: The {entity_id} represents the Edge node UUID where the load balancer is deployed, and {lb_capacity_usage} are dynamic values reflecting the current usage percentage.

Environment

VMware NSX-T Data Center

VMware NSX

Cause

The alarm is triggered when the load balancer capacity usage on an Edge node reaches or exceeds the configured threshold of 80%. This occurs due to the capacity-based allocation model used by NSX-T for load balancer resource management.

Understanding Load Balancer Capacity Allocation:
Each Edge node has a finite load balancer capacity based on its size (form factor). Load balancer services consume capacity according to their size. The system uses an internal capacity allocation system to measure and manage resources, where each Edge node size has a specific number of capacity units available. The usage percentage is calculated as: (current capacity used / total capacity available) × 100%.

Edge node size and load balancer capacity:

Edge node size	Credits Available	Load balancer size	Credits required per LB size
Small	1	Small	1
Medium	10	Medium	10
Large	40	Large	40
XLarge	80	XLarge	80
BareMetal	750

Important Consideration - Same-Size Node and Load Balancer:
When a load balancer instance is deployed on an Edge node of the same size (e.g., a Medium load balancer on a Medium Edge node), the usage percentage will be 100%. This is because the load balancer consumes all available capacity on that node.

For example:

A Medium Edge node has 10 capacity units available.
A Medium load balancer requires 10 capacity units.
Usage percentage = (10 / 10) × 100% = 100%

This configuration, while technically valid, leaves no capacity for additional load balancer instances and will trigger the alarm once the threshold is reached. It is recommended that Edge nodes be sized larger than the load balancer instances they host to provide headroom for future scaling and to avoid capacity constraints.

The Edge nodes can only host load balancer instances of specific sizes, which depends on the relationship between the load balancer size and the Edge node size. For detailed information on the number of load balancers allowed per Edge node and size compatibility, refer to the VMware Configuration Maximums documentation for your specific NSX version.

Use the following API to check the load balancer usage per NSX-T Edge node:
GET https://<policy-mgr>/policy/api/v1/infra/lb-node-usage?node_path=<node-path>

The node_path parameter is required and should be in the format:
/infra/sites/default/enforcement-points/default/edge-clusters/<edge-cluster-id>/edge-nodes/<edge-node-id>

Sample API response:

{
"form_factor": "MEDIUM_VIRTUAL_MACHINE",
"edge_cluster_path": "/infra/sites/default/enforcement-points/default/edge-clusters/########-####-####-############",
"current_load_balancer_credits": 10,
"load_balancer_credit_capacity": 10,
"usage_percentage": 100.0,
"severity": "RED",
"current_pool_member_count": 0,
"current_virtual_server_count": 1,
"current_pool_count": 0,
"pool_member_capacity": 20000,
"current_small_load_balancer_count": 0,
"current_medium_load_balancer_count": 1,
"current_large_load_balancer_count": 0,
"current_xlarge_load_balancer_count": 0,
"remaining_small_load_balancer_count": 0,
"remaining_medium_load_balancer_count": 0,
"remaining_large_load_balancer_count": 0,
"remaining_xlarge_load_balancer_count": 0,
"resource_type": "LBEdgeNodeUsage",
"node_path": "/infra/sites/default/enforcement-points/default/edge-clusters/########-####-####-############/edge-nodes/0"
}

Key fields in the response:

Current_load_balancer_credits: The number of capacity units currently consumed by load balancers on the node (Note: The API uses "credits" as an internal term for capacity allocation units).
Load_balancer_credit_capacity: The maximum capacity units available for load balancers on the node.
Usage_percentage: The percentage of load balancer capacity in use (alarm triggers at ≥80%).
Severity: Indicates the severity level (GREEN, ORANGE, or RED) based on usage percentage.
Remaining_*_load_balancer_count: The number of additional load balancer instances of each size that can be deployed.

Note: The documented configuration maximums may differ depending on NSX versions. Please refer to the VMware Configuration Maximums documentation for your specific version.

Resolution

Recommended Actions:

Assess Current Capacity Usage:
- Use the API endpoint GET /policy/api/v1/infra/lb-node-usage?node_path=<node-path> to retrieve detailed capacity information for the affected Edge node.
- Review the usage_percentage, current capacity consumption, and remaining capacity values.
- Identify which load balancer instances are deployed on the node.
For Multiple Load Balancer Instances:
If multiple load balancer instances have been configured on this Edge node:
- Deploy a new Edge node in the same Edge cluster.
- Relocate some load balancer instances to the newly deployed Edge node to distribute the load.
- For Tier1 gateway load balancers, you can use the LB Scale Runbook by invoking /infra/sha/runbook-invocations/<invoke_id> with a unique invoke_id (e.g., lb_scale_4b17dca6) to get detailed relocation action plans.
  Alternatively, manually relocate load balancers by invoking PATCH /policy/api/v1/infra/tier-1s/<tier-1-id>/locale-services/<locale-services-id> with the target Edge node ID in the request body.
- For VPC load balancers, for 9.1 version, manually relocate load balancers by invoking POST /policy/api/v1/infra/gateways/action/reallocate with the vpc path and the target Edge node or VNA node path. For the previous version, manual relocation is not supported, consider relocating the load balancers to another Edge or VNA cluster.
For Single Load Balancer Instance on Same-Size Node:
If only a single load balancer instance (small/medium/large/xlarge) has been configured on an Edge node of the same size (small/medium/large/xlarge):
- This configuration results in 100% capacity usage, leaving no room for additional instances.
- Deploy a new Edge node of a larger size in the Edge cluster.
- Move the load balancer instance to the newly deployed larger Edge node.
- This provides additional capacity headroom for future load balancer deployments and prevents capacity constraints.
- Best Practice: Always size Edge nodes larger than the load balancer instances they host to ensure adequate capacity for scaling.
Verify Resolution:
- After relocating load balancers or deploying new nodes, verify the alarm clears.
- Re-check the usage percentage using the API to confirm it is below the 80% threshold.
- Monitor the Edge cluster to ensure balanced load balancer distribution across nodes.

Note: In the case of XLarge load balancers, only Bare-Metal Edge Nodes support multiple XLarge load balancers. Therefore, this workaround requires Bare-Metal Edge Nodes when deploying multiple XLarge load balancers.

Maintenance window required for remediation?
Yes, relocating load balancers or deploying new Edge nodes may require a maintenance window depending on your deployment configuration and redundancy requirements.

Additional Information

For additional information regarding the "Read load balancer usage for the given node" API, please refer to the NSX REST API Guide:

If you are contacting Broadcom support about this issue, please provide the following:

NSX Manager support bundles.
NSX Edge support bundles.
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation.
Output from the load balancer usage API for the affected Edge node(s).
Edge cluster configuration details.

Handling Log Bundles for offline review with Broadcom support