kubectl login to a guest cluster takes a long time and times out intermittently with error "invalid character '<' looking for beginning of value"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Environments with many guest clusters (100-200) are affected

Login process sometimes takes 1- 2 minutes and fails intermittently

Error message when timing out:

DEBU[0007] Logging in to Tanzu Kubernetes cluster (#####) (#####)  
DEBU[0067] Got response: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.25.2</center>
</body>
  /html>

ERRO[0067] Login failed: invalid character '<' looking for beginning of value  
ERRO[0067] Failed login to Tanzu Kubernetes cluster xxxxxx: invalid character '<' looking for beginning of value

Sometimes takes long time to login:

[UserXX@v7p-bastion login]$ kubectl vsphere login \
   --server=<SUPERVISOR_CONTROL_PLANE_IP> \
   --vsphere-namespace=<VSPHERE_NAMESPACE> \
   --tanzu-kubernetes-cluster-name=<TKG_CLUSTER_NAME> \
   --insecure-skip-tls-verify \
   -v=10
DEBU[2025-10-02 11:23:25.100] User passed verbosity level: 10
DEBU[2025-10-02 11:23:25.100] Setting verbosity level: 10
DEBU[2025-10-02 11:23:25.100] Setting request timeout:
DEBU[2025-10-02 11:23:25.100] login called as: /usr/bin/kubectl-vsphere login --server=<IP> --tanzu-kubernetes-cluster-namespace=infra-service --tanzu-kubernetes-cluster-name=testcluster -v=10
DEBU[2025-10-02 11:23:25.100] Creating wcp.Client for 172.xx.xx.xx
DEBU[2025-10-02 11:23:25.119] Got response:

Username: username@domain
INFO[2025-10-02 11:23:33.392] Using username@domain   username.
DEBU[2025-10-02 11:23:33.392] KUBECTL_VSPHERE_PASSWORD environment variable is not set
Password:

<<stuck about 30s>>

DEBU[2025-10-02 11:24:07.685] Got response: [{"namespace":...]
DEBU[2025-10-02 11:24:07.796] Got response: {"session_id":...}
INFO[2025-10-02 11:24:07.798] User has existing context; will not override.
DEBU[2025-10-02 11:24:07.798] Logging in to Tanzu Kubernetes cluster (clustername) (XXXXX)
DEBU[2025-10-02 11:24:07.944] Got response: {"session_id":...}
INFO[2025-10-02 11:24:07.950] Successfully logged in to Tanzu Kubernetes cluster 172.xx.xx.xx
DEBU[2025-10-02 11:24:07.958] Trying to login to 10.xx.xx.xx
DEBU[2025-10-02 11:24:07.958] Creating wcp.Client for 10.xx.xx.xx

<<stuck about 30s>>

DEBU[2025-10-02 11:24:38.160] Got response: {"session_id":...}
Logged in successfully.

Environment

Supervisor version v1.27.5 and above

Cause

wcp-authproxy logs:

2025-06-11T13:51:01.085603988Z stderr F INFO:auth.filters:[140206217241472] User authenticated using basic token.

:

:(took 49 seconds to get cluster ID)

2025-06-11T13:51:49.831196531Z stderr F DEBUG:vclib.token:[140206217241472] Cluster UID (28550edf-f0f8-4f81-98ff-4f3fc335ac04), server (10.XX.XX.XX)

Also, the UID is fetched using v1aplha1

DEBUG:apiserver.guestcluster:url: b'https://127.0.0.1:6443/apis/run.tanzu.vmware.com/v1alpha1/tanzukubernetesclusters'

There is webhook conversion from kube-api logs like below:

2025-06-11T13:51:01.615013367Z stderr F I0611 13:51:01.614789 1 trace.go:219] Trace[383739101]: "Call conversion webhook" custom-resource-definition:tanzukubernetesclusters.run.tanzu.vmware.com,desired-api-version:run.tanzu.vmware.com/v1alpha1,object-count:1,UID:bccba1c6-9df1-4ab2-be3a-5c679a5d962d (11-Jun-2025 13:51:01.142) (total time: 472ms):
2025-06-11T13:51:01.615047361Z stderr F Trace[383739101]: ---"Request completed" 471ms (13:51:01.614)
2025-06-11T13:51:01.615055556Z stderr F Trace[383739101]: [472.371375ms] [472.371375ms] END
2025-06-11T13:51:01.72069226Z stderr F I0611 13:51:01.720599 1 trace.go:219] Trace[1615218923]: "Call conversion webhook" custom-resource-definition:tanzukubernetesclusters.run.tanzu.vmware.com,desired-api-version:run.tanzu.vmware.com/v1alpha1,object-count:1,UID:03680026-b3c0-40eb-b929-e2ffd1b46eef (11-Jun-2025 13:51:01.616) (total time: 104ms):

The delay in response is due to combination of the following factors:

1. The number of TKCs running on the Supervisor. (if there are more no.of tkc clusters managed by supervisor)
2. authproxy using API version v1alpha1 instead of v1alpha3 when listing TKCs, triggering the conversion webhook for each TKC, ultimately resulting in a delay.

Resolution

Issue will be fixed in a future release of the Supervisor.

Workaround:

Apply the following changes on all 3 Supervisor control panes

SSH to the Supervisor control plane:

1. Edit the /var/lib/*/authproxy/apiserver/constants.py file and update the following:

TANZU_PREFIX = 'run.tanzu.vmware.com'
TANZU_RESOURCE_VERSION = 'v1alpha1' ### <<<=== Change to v1alpha3
TANZU_KUBERNETES_CLUSTERS_RESOURCE = 'tanzukubernetesclusters'

Get the exact path to constants.py by using: find /var/lib -name constants.py | grep apiserver

2. Edit /var/lib/*/authproxy/apiserver/response.py and change -

From

endpoints = clusterObject['status']['clusterApiStatus']['apiEndpoints']

To

endpoints = clusterObject['status']['apiEndpoints']

Get the exact path to response.py by using: find /var/lib -name response.py | grep apiserver

3. Get the authproxy container id using: crictl ps | grep authproxy

Example

crictl ps | grep authproxy
d815b7fd31e2d 8d70421b0c268 Less than a second ago Running wcp-authproxy 2 741d133b8d37b wcp-authproxy-42316a8afe01a361f440ab54f584ab3d

4. Stop the authproxy container: crictl stop <container_id>

Example:

crictl stop d815b7fd31e2d

5. Verify if the authproxy container was started again: crictl ps | grep authproxy