VKS cluster login fails with 504 Bad Gateway error

Products

VMware vSphere Kubernetes Service VMware vCenter Server

Issue/Introduction

While trying to login to vSphere Kubernetes Service cluster via kubectl vsphere login, error "504 Bad Gateway" is received.
Upon validating the /var/log/vmware/vapi/endpoint/endpoint.log, below entries are seen:

<date><time> | INFO | or@24db9783{HTTP/1.1, (http/1.1)}{127.0.0.1:12346} | ConnectionLimit | | Connection Limit(550) reached for [ServerConnector@24db9783{HTTP/1.1, (http/1.1)}{127.0.0.1:12346}]
<date><time> | INFO | or@24db9783{HTTP/1.1, (http/1.1)}{127.0.0.1:12346} | ConnectionLimit | | Connection Limit(550) reached for [ServerConnector@24db9783{HTTP/1.1, (http/1.1)}{127.0.0.1:12346}]

/var/log/vmware/vapi/endpoint/endpoint.log also show that content library operations are failing:

<date><time> | ERROR | vAPI-I/O dispatcher-0 | SessionFacade | cb7d15###a2-66ee-99a##b-a194-######| Unexpected error occurred while executing the call with session <username>-c564##93-####-173b-403d-b1###2b-####@VSPHERE.LOCAL (internal id 1ee8b1cb) for method com.vmware.content.library.item.find.com.vmware.vapi.client.exception.TransportProtocolException: HTTP response with status code 504 (enable debug logging for details): stream timeout

There are Content libraries for which the underlying datastores are missing or have connectivity issues.
Node scale up activities on guest clusters sometimes takes 12+ hours to succeed.
- You may notice an intermittent issue in your environment where scaling up the number of worker nodes or creating a new worker pool (both of which require provisioning new VMs for nodes) can result in the new Machine (kubernetes cluster API resource) stuck in the `Pending` state. When this condition is left as-is, we usually see the new VM come online overnight or after a delay of anywhere from 3-16 hours (approximate).

Environment

VMware vCenter Server 8.x
VMware vSphere Kubernetes Service

Cause

When there are excessive API calls to Content Library API through vapi-endpoint. The Content Library service does not respond in a timely manner and the connection remains open which results in depletion of vapi connections.
Also, Content library might not respond timely due other factors like underlying backing datastores for the configured content libraries. This prevents execution of any content library operations requiring activity threads.

Resolution

This is a known issue with Content Library Service and Broadcom Engineering is working towards a fix for this issue.

Workaround:

Validate and fix any underlying issues with content libraries backed datastore .
Increase thread pool size for Content Library:
1. SSH to the vCenter Server.
2. Take a backup of /etc/vmware-content-library/vdc.properties file:
  
  cp /etc/vmware-content-library/vdc.properties cp /etc/vmware-content-library/vdc.properties.bkp
3. Edit the /etc/vmware-content-library/vdc.properties using vi editor.
4. Hit Insert key and append the below property to the file:
  
  cls.activity.threadPool.size=48
5. Hit ESC key and type :wq! to save and exit the file.
6. Restart the content library service:
  
  service-control --restart content-library

Additional Information

See the following KB if there are issues with the content library backing datastore:

Datastore backing Content Library "does not exist, the storage backing might be removed, disconnected, or no longer accessible"