vSphere Supervisor with NSX-T NCP Load Balancer Service Created but External-IP <pending> due to Load Balancer Member Limit Reached

Products

VMware vSphere Kubernetes Service Tanzu Kubernetes Runtime VMware vSphere 7.0 with Tanzu VMware NSX for vSphere vSphere with Tanzu

Issue/Introduction

In a vSphere Supervisor environment that uses NSX-T and the NSX-NCP pod for load balancing, a new service of load balancer type shows the following symptoms:

The new load balancer service does not receive an external IP address, showing as <pending>:

kubectl get svc -n <namespace> | grep pending

NAMESPACE       NAME                           TYPE            INTERNAL IP      EXTERNAL IP
<namespace>     <loadbalancer service name>    LoadBalancer    <internal IP>    <pending>

When describing the new load balancer service with <pending>, the error messages similar to the following are present, where the values encased in brackets <> will vary by environment:
- In the below error, the <LB limit> is determined by the size of the load balancer associated with the namespace
```
kubectl describe svc -n <namespace> <loadbalancer service name>

nsx-container-ncp  LB Service <load balancer> limit exceeded: Unable to attach new resource <new member> to lbs <load balancer>: LBS exceeded limit of <LB limit>.
```
- This service will also have the following annotation:
  - ```
  Annotations:     ncp/error.loadbalancer: LBS_LIMIT_EXCEEDED
```

While connected to the Supervisor cluster context, the NSX-NCP pod log may show errors similar to the following:

kubectl get pods -n nsx-ncp
kubectl logs -n nsx-ncp <nsx ncp pod>

The maximum size of pool members for <load balancer SIZE> load balancer service form factor is <load balancer size limit>, current size of pool members is <greater than or equal to the load balancer size limit>"

In the vCenter web UI, when navigating to the Monitor tab for the affected namespace, an error message similar to the following may be present:

LB Service domain-c##:<ID> limit exceeded: Unable to attach new resource <resource ID> to lbs domain-c##:<ID>: LBS exceeded limit of 2000.

Environment

vSphere Supervisor

NSX 4.X

NCP (NSX-T Container Plugin) 4.X

Cause

The noted load balancer has reached its limit of members or the new load balancer will increase the member count beyond the load balancer's limit.

Load Balancer autoscaling works only against number of virtual servers, not the pool member count.
This is expected behavior by design as relocating the services can require downtime.
The default load balancer size is SMALL, which has a limit of 300 pool members by default in NSX-T. However, this is overridden by NSX-NCP in vSphere Supervisor using NSX-T.
NSX-NCP overrides the default load balancer pool member limit:
- Small Pool Member Limit: 2,000
- Medium Pool Member Limit: 2,000
- Large Pool Member Limit: 6,000

In a NSX-T load-balancer, pool-members are created to distribute the traffic between them. Every pool member is an object containing a unique pool-member-IP+Port combination.

Resolution

vSphere Supervisor currently does not support changing the size of an existing load balancer.

The only solutions are the following:

Modify the existing workloads for the namespace and/or cluster to reduce the members for the affected load balancer.
Create a new namespace with a larger load balancer size and migrate workloads to the new workload cluster(s) in that namespace.
- NSX-NCP overrides the default load balancer pool member limit in vSphere Supervisor using NSX-T.
  - Small Pool Member Limit: 2,000
  - Medium Pool Member Limit: 2,000
  - Large Pool Member Limit: 6,000
Create multiple namespaces with workload clusters to disperse the migrated workloads across multiple load balancers.

Load Balancer Member Count Check

The following steps can be run from a machine that can reach the NSX manager to confirm on the status of the affected load balancer:

Retrieve the affected NSX load balancer's ID, replacing <load balancer> with the load balancer from the error message:

curl -k -u admin:'<password>' "https://<NSX_MANAGER>/api/v1/loadbalancer/services/" | grep -A4 "<load balancer>"

Which will return an output similar to the below:

"resource_type" :"LbService",
"id": "<load balancer ID>"
"display_name" : "<load balancer>"

Perform the following api command to retrieve usage information on the affected load balancer using the <load balancer ID> from the previous step:
- ```
curl -k -u admin:'<PASSWORD>' "https://<NSX_MANAGER>/api/v1/loadbalancer/services/<load balancer ID>/usage"
```
- Which should output similar to the following where the values will vary by environment and load balancer, but the load balancer shows as severity RED due to high or 100% usage:
- ```
{
  "service_id" : "<load balancer id>",
  "service_size" : "<load balancer SIZE>",
  "virtual_server_capacity" : ##,
  "pool_capacity" : ##,
  "pool_member_capacity" : <load balancer size limit>,
  "current_virtual_server_count : ##,
  "current_pool_count : ###,
  "current_pool_member_count : ###,
  "usage_percentage" : ##.#,
  "severity" : "RED"
}
```
- Note: NSX-NCP overrides the default load balancer pool member limit in vSphere Supervisor.
  - This override is not accurately reflected in the above NSX API output which will only show the NSX-T defaults.

With the load balancer ID from Step 1, the below api command can be used to retrieve a list of pool members:

curl -k -u admin:'<password>' "https://<NSX_MANAGER>/api/v1/loadbalancer/services/<load balancer ID>/status?source=realtime"

Additional Information

WARNING: It is not supported to edit the nsx-ncp-config configmap configuration in a vSphere Supervisor environment.

Changes made to the nsx-ncp-config configmap will be reverted on Supervisor control plane node recreation, such as a Supervisor cluster upgrade.