ERRORS: "LBS_LIMIT_EXCEEDED" and "The maximum size of pool members for SMALL load balancer service form factor is 300"

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

Scenario:

You experience issues with existing or new K8s Service of type LoadBalancer.
Your TKGI foundation is running with NSX-T as its Container Network Interface.
Your K8s cluster was provisioned with a Small LoadBalancer (the default).
You have one or more Services of type LoadBalancer deployed.
You are unable to create new LoadBalancer Services.
You are unable to recreate previously running LoadBalancer Services.

Errors:

One or more of your LoadBalancer Services shows this annotation:

ncp/error.loadbalancer_endpoints: LBS_LIMIT_EXCEEDED

The NCP logs on your K8s master node reports the following:

"The maximum size of pool members for SMALL load balancer service form factor is 300, current size of pool members is 301"

State:

One or more of your LoadBalancer Services shows as Pending.
From the NSX Manager, your cluster LoadBalancer status shows as DOWN

Issue Details:

Use the the NSX REST API. Get the usage of the primary Load Balancer.
If it shows usage_percentage at 100.00, the you have run into the issue.

Using the NSX REST API call to get LB Usage:

Locate the primary k8s cluster Load Balancer from the NSX Manager UI:
Identify the k8s cluster UUID

tkgi cluster CLUSTER_NAME | grep UUIDExample cluster UUID: 9ad567e6-3535-4fa3-a2d1-51804f7904ef
Login to the NSX Manager UI
Go to Networking
Select the Load Balancing tab on the left
Select Load Balancers tab
Find the the Name of the primary Load Balancer. Format is pks-<UUID>:

Example: pks-9ad567e6-3535-4fa3-a2d1-51804f7904ef
Checkmark the Load Balancer from the list of Names
Copy the Load Balancer ID from the right pane to use in the below API call.

Example: b6a5bffa-f6a9-447c-936f-4633f9eb43g1Example REST API call from above:

curl -k -u admin:'<PASSWORD>' "https://<NSX_MANAGER>/api/v1/loadbalancer/services/b6a5bffa-f6a9-447c-936f-4633f9eb43g1/usage"

{

"service_id" : "b6a5bffa-f6a9-447c-936f-4633f9eb43g1",

"service_size" : "SMALL",

"virtual_server_capacity" : 20,

"pool_capacity" : 60,

"pool_member_capacity" : 300,

"current_virtual_server_count : 16,

"current_pool_count : 14,

"current_pool_member_count : 300,

"usage_percentage" : 100.0,

"severity" : "RED"

}

For more details on this NSX API call, refer to:

Read the usage information of the given load balancer service

or

Read Load Balancer Service Usage

LB Usage output details:

Notice usage_percentage is 100.0
The current_pool_member_count is >= pool_member_capacity. In the case for a Small LB, pool_member_capacity is 300.
The severity shows as "RED".
You may note your current_virtual_server_count might be less than virtual_server_capacity. VS count is not relevant here. Refer to Cause.

Environment

TKGI: 1.20.0 and lower
NSX-T: 4.1.2.1
NCP (NSX-T Container Plugin): 4.1.2.1

Cause

In this scenario the configured cluster LB has reached its pool member capacity.
Load Balancer autoscaling works only against number of virtual servers. Not pool member count.
This is expected behavior. It is by design. Because relocating the services can require downtime.

Also:

TKGI does not currently support editing the cluster LB size for an existing cluster. (See External References)
You must modify existing workloads on the cluster OR create a new cluster with a larger LoadBalancer configuration to accommodate the workloads.
HOWEVER, it is possible you may have stale/orphaned NSX objects. Open a support case to have this looked into by an Engineer.
ALSO, there is a temporary workaround you may be able to apply to the existing cluster while you plan your infrastructure and workload changes.
Open a Support case for more information. Reference this article. An Engineer can assess your situation and the options if they apply in your scenario. (see Resolution)

Additional Cause Details:

When you deploy a Kubernetes cluster using Tanzu Kubernetes Grid Integrated Edition on NSX, an NSX Load Balancer is automatically provisioned. By default the size of this load balancer is Small
In a NSX-T load-balancer, pool-members are created to distribute the traffic between them. Every pool member is an object containing a unique pool-member-IP+Port combination.
The total number of pool members across the LB is 300 in this case.
Refer to the NSX 4.1.2 Configuration Limits in the VMware Configuration Maximums documentation. (See External References)
Refer to the NSX 4.1.2 Configuration Limits (See External References) documentation for more details on info such as:
- LoadBalancer sizes: Medium, Large, etc
- Virtual Servers per XXXX Load Balancer
- Pools per XXXX Load Balancer
- Pool Members per XXXX Load Balancer

Resolution

Open a Support case and reference this article.
An Engineer can:
- Assess whether there are stale/orphaned NSX objects causing the issue; and assist in removing them.
- Discuss a temporary workaround you may be able to apply to your cluster while you update your infrastructure and workloads.
There is a product feature enhancement story to include autoscale functionality for pool members: (TCPF-312). (See External References)
- Contact your Account Team (not Product Support) for more details and any feature status. Mention this KB and the Feature ID above.

Additional Information

External References:

In TKGI, you cannot modify the LoadBalancer Size configuration on an existing cluster.
If using the default Load Balancer Size configuration (Small) With network profiles, "you can change the size of the load balancer deployed by NSX at the time of cluster creation"
Refer to:
- Resizing Load Balancers
- Size a Load Balancer and Load Balancer Sizing
Potential product enhancement ID: TCPF-312
NSX 4.1.2 Configuration Limits

Note:
Previous NSX-T product versions were referenced as NSX-T Data Center X.X.X (vs NSX X.X.X)

Example:

NSX-T Data Center 3.2.4 Configuration Limits

VMware Configuration Maximums

https://configmax.esp.vmware.com/home