Environments with many guest clusters (100-200) are affected
Login process sometimes takes 1- 2 minutes and fails intermittently
Error message when timing out:
DEBU[0007] Logging in to Tanzu Kubernetes cluster (#####) (#####)
DEBU[0067] Got response: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.25.2</center>
</body>
/html>
ERRO[0067] Login failed: invalid character '<' looking for beginning of value
ERRO[0067] Failed login to Tanzu Kubernetes cluster xxxxxx: invalid character '<' looking for beginning of value
Sometimes takes long time to login:
[UserXX@v7p-bastion login]$ kubectl vsphere login \
--server=<SUPERVISOR_CONTROL_PLANE_IP> \
--vsphere-namespace=<VSPHERE_NAMESPACE> \
--tanzu-kubernetes-cluster-name=<TKG_CLUSTER_NAME> \
--insecure-skip-tls-verify \
-v=10
DEBU[2025-10-02 11:23:25.100] User passed verbosity level: 10
DEBU[2025-10-02 11:23:25.100] Setting verbosity level: 10
DEBU[2025-10-02 11:23:25.100] Setting request timeout:
DEBU[2025-10-02 11:23:25.100] login called as: /usr/bin/kubectl-vsphere login --server=<IP> --tanzu-kubernetes-cluster-namespace=infra-service --tanzu-kubernetes-cluster-name=testcluster -v=10
DEBU[2025-10-02 11:23:25.100] Creating wcp.Client for 172.xx.xx.xx
DEBU[2025-10-02 11:23:25.119] Got response:
Username: username@domain
INFO[2025-10-02 11:23:33.392] Using username@domain username.
DEBU[2025-10-02 11:23:33.392] KUBECTL_VSPHERE_PASSWORD environment variable is not set
Password:
<<stuck about 30s>>
DEBU[2025-10-02 11:24:07.685] Got response: [{"namespace":...]
DEBU[2025-10-02 11:24:07.796] Got response: {"session_id":...}
INFO[2025-10-02 11:24:07.798] User has existing context; will not override.
DEBU[2025-10-02 11:24:07.798] Logging in to Tanzu Kubernetes cluster (clustername) (XXXXX)
DEBU[2025-10-02 11:24:07.944] Got response: {"session_id":...}
INFO[2025-10-02 11:24:07.950] Successfully logged in to Tanzu Kubernetes cluster 172.xx.xx.xx
DEBU[2025-10-02 11:24:07.958] Trying to login to 10.xx.xx.xx
DEBU[2025-10-02 11:24:07.958] Creating wcp.Client for 10.xx.xx.xx
<<stuck about 30s>>
DEBU[2025-10-02 11:24:38.160] Got response: {"session_id":...}
Logged in successfully.
Supervisor version v1.27.5 and above
wcp-authproxy logs:
2025-06-11T13:51:01.085603988Z stderr F INFO:auth.filters:[140206217241472] User authenticated using basic token.
:
:(took 49 seconds to get cluster ID)
2025-06-11T13:51:49.831196531Z stderr F DEBUG:vclib.token:[140206217241472] Cluster UID (28550edf-f0f8-4f81-98ff-4f3fc335ac04), server (10.XX.XX.XX)
Also, the UID is fetched using v1aplha1
DEBUG:apiserver.guestcluster:url: b'https://127.0.0.1:6443/apis/run.tanzu.vmware.com/v1alpha1/tanzukubernetesclusters'
There is webhook conversion from kube-api logs like below:
2025-06-11T13:51:01.615013367Z stderr F I0611 13:51:01.614789 1 trace.go:219] Trace[383739101]: "Call conversion webhook" custom-resource-definition:tanzukubernetesclusters.run.tanzu.vmware.com,desired-api-version:run.tanzu.vmware.com/v1alpha1,object-count:1,UID:bccba1c6-9df1-4ab2-be3a-5c679a5d962d (11-Jun-2025 13:51:01.142) (total time: 472ms):
2025-06-11T13:51:01.615047361Z stderr F Trace[383739101]: ---"Request completed" 471ms (13:51:01.614)
2025-06-11T13:51:01.615055556Z stderr F Trace[383739101]: [472.371375ms] [472.371375ms] END
2025-06-11T13:51:01.72069226Z stderr F I0611 13:51:01.720599 1 trace.go:219] Trace[1615218923]: "Call conversion webhook" custom-resource-definition:tanzukubernetesclusters.run.tanzu.vmware.com,desired-api-version:run.tanzu.vmware.com/v1alpha1,object-count:1,UID:03680026-b3c0-40eb-b929-e2ffd1b46eef (11-Jun-2025 13:51:01.616) (total time: 104ms):
The delay in response is due to combination of the following factors:
1. The number of TKCs running on the Supervisor. (if there are more no.of tkc clusters managed by supervisor)
2. authproxy using API version v1alpha1 instead of v1alpha3 when listing TKCs, triggering the conversion webhook for each TKC, ultimately resulting in a delay.
Issue will be fixed in a future release of the Supervisor.
Workaround:
Apply the following changes on all 3 Supervisor control panes
SSH to the Supervisor control plane:
1. Edit the /var/lib/*/authproxy/apiserver/constants.py file and update the following:
TANZU_PREFIX = 'run.tanzu.vmware.com'TANZU_RESOURCE_VERSION = 'v1alpha1' ### <<<=== Change to v1alpha3TANZU_KUBERNETES_CLUSTERS_RESOURCE = 'tanzukubernetesclusters'
Get the exact path to constants.py by using: find /var/lib -name constants.py | grep apiserver
2. Edit /var/lib/*/authproxy/apiserver/response.py and change -
From
endpoints = clusterObject['status']['clusterApiStatus']['apiEndpoints']
To
endpoints = clusterObject['status']['apiEndpoints']
Get the exact path to response.py by using: find /var/lib -name response.py | grep apiserver
3. Get the authproxy container id using: crictl ps | grep authproxy
Example
crictl ps | grep authproxyd815b7fd31e2d 8d70421b0c268 Less than a second ago Running wcp-authproxy 2 741d133b8d37b wcp-authproxy-42316a8afe01a361f440ab54f584ab3d
4. Stop the authproxy container: crictl stop <container_id>
Example:
crictl stop d815b7fd31e2d
5. Verify if the authproxy container was started again: crictl ps | grep authproxy