vSphere Kubernetes Kubectl vSphere Login to Supervisor Cluster Context and any TKG Service Cluster Context Failing with "Failed to get available workloads: bad gateway" due NTP Time Sync Issues

Products

VMware vSphere Kubernetes Service VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime VMware vCenter Server VMware vCenter Server 8.0 VMware vCenter Server 7.0

Issue/Introduction

Any kubectl vsphere login attempts fail, regardless of the context specified and the user logging in.

Supervisor Cluster Context:

kubectl vsphere login --vsphere-username <SSO@username> --server=<Workload Management Supervisor Control Plane Node IP Address>

TKG Service Cluster Context:

kubectl vsphere login --server <Workload Management Supervisor Control Plane Node IP Address> --vsphere-username <SSO@username> --tanzu-kubernetes-cluster-namespace <namespace> --tanzu-kubernetes-cluster-name <clusterName>

time="YYYY-MM-DDTHH:MM:SS-00:00" level=fatal msg="Failed to get available workloads, response from the server was invalid."

When performing a login attempt with a higher verbosity flag as per below, the following bad gateway error message is returned after getting a response from the wcp.Client:

kubectl vsphere login --server <Workload Management Supervisor Control Plane Node IP Address> --vsphere-username <SSO@username> --tanzu-kubernetes-cluster-namespace <namespace> --tanzu-kubernetes-cluster-name <clusterName> -v10

time="YYYY-MM-DDTHH:MM:SS-00:00" level=debug msg="Creating wcp.Client for <Workload Management Supervisor Control Plane Node IP Address>."

time="YYYY-MM-DDTHH:MM:SS-00:00" level=info msg="Does not appear to be a vCenter or ESXi address."

time="YYYY-MM-DDTHH:MM:SS-00:00" level=debug msg="Got response: \n"

time="YYYY-MM-DDTHH:MM:SS-00:00" level=info msg="Using <SSO@username> as username."

time="YYYY-MM-DDTHH:MM:SS-00:00" level=debug msg="Env variable KUBECTL_VSPHERE_PASSWORD is present \n"

time="YYYY-MM-DDTHH:MM:SS-00:00" level=debug msg="Error while getting list of workloads: bad gateway\nPlease contact your vSphere server administrator for assistance."

The wcpsvc logs from the vCenter Server Appliance (VCSA) show Token expiry errors around the time of the login attempt(s) similar to the below:

```
cat /var/log/vmware/wcp/wcpsvc.log
```

err: HTTP request failed: POST, url: https://<VCSA-FQDN/URL>:443/rest/vcenter/tokenservice/token-exchange, code:500, body: '{"type":"com.vmware.vcenter.tokenservice.invalid_grant","value":{"messages":[{"args":[],"default_message":"Invalid Subject token: tokenType=SAML2","id":"com.vmware.vcenter.tokenservice.exceptions.InvalidGrant"},{"args":[],"default_message":"Token expiration date: DAY MON DD HH:MM:SS GMT YYYY is in the past.","id":"com.vmware.identity.saml.InvalidTokenException"},{"args":[],"default_message":"Token expiration date: DAY MON DD HH:MM:SS GMT YYYY is in the past.","id":"com.vmware.vim.sso.client.exception.InvalidTimingException"}}}}'

However, the /var/log/vmware/vpxd/vpxd-svcs logs do not show any issues generating the SAML token.

While connected to the Supervisor cluster context:

The wcp-authproxy logs show a similar SAML Token error message around the time of the login attempt(s):

kubectl get pods -A | grep authproxy

kubectl logs -n <authproxy namespace> <wcp-authproxy pod name>

InternalServerError - occurred on authorization: the SAML token was not exchanged, as it is expired, invalid or absent

Supervisor cluster certificates are not expired as per the following KB:
- vSphere with Tanzu TKC login fails with "Login failed: bad gateway"
There is not an AD account naming issue as per the below KB:
- kubectl vsphere login fails with "Failed to get available workloads: bad gateway"

For more information kubectl vsphere login and contexts with SSO, please see the below documentation:

TechDocs - kubectl vsphere login to the Supervisor Cluster Context

TechDocs - kubectl vsphere login to the TKG Service Cluster Context (also known as Guest Cluster or vSphere Kubernetes Cluster)

Environment

vSphere with Tanzu 7.0

vSphere with Tanzu 8.0

This issue can occur regardless of whether or not the clusters are managed by Tanzu Mission Control (TMC)

Cause

There is a NTP time sync, time skew or time drift issue between the Supervisor Cluster and the vCenter/VCSA.

This can occur if there is a difference between time servers or configuration of time servers in the various components that make up the environment: vCenter, ESXi Host, Supervisor Cluster

Time sync issues can occur when there is at least a five minute difference between components.

Resolution

The component which is experiencing the time difference within the environment will need to be located and its time re-synced.

This KB will provide checks depending on the component: vCenter, ESXi Host, Supervisor Cluster, vSphere Kubernetes Cluster

Check vCenter/VCSA Time Sync

Login to the vCenter Server Management (VAMI) as root:
- ```
https://<vCenter-FQDN>:5480
```
Click on Time under the left navigation bar
Check if the time zone, time servers and current server time do not have any issues.
- Note: This issue can occur if the time servers are different than the time servers on the ESXI host(s) in the environment.
The current time on the VCSA can also be checked from running the below command while SSH into the VCSA with bash shell enabled:
- ```
date
```

Check ESXi Hosts Time Sync

Login to the vSphere web client
Navigate under Inventory to the ESXi hosts used in the environment
Click on Configure -> System -> Time Configuration on the ESXi host
Check if the Time Synchronization and the Network Time Protocol section show accurately and do not have any issues.
- Note: This issue can occur if the VAMI time servers are different than the time servers on the ESXI host(s) in the environment.

Check Supervisor Cluster Time Sync

SSH into one of the Supervisor control plane VMs as root:
- See "How to SSH into Supervisor Control Plane VMs" in Troubleshooting vSphere with Tanzu (TKGS) Supervisor Control Plane VMs
The current time on the Supervisor control plane VM can be checked by running the below command:
- ```
date
```
Run the following commands to check the status of NTP on the Supervisor control plane VM:
- ```
timedatectl show
```
- ```
timedatectl status
```
- ```
timedatectl timesync-status
```
- ```
timedatectl show-timesync
```
Further troubleshooting steps for NTP and DNS can be found in Common issues with a vSphere with Tanzu Cluster deployment stuck in Configuring state

Check vSphere Kubernetes Cluster Time Sync

SSH to a node in the vSphere Kubernetes Cluster as vmware-system-user:
- SSH to TKG Service Cluster Nodes as the System User Using a Password

Confirm on status of time sync services:

chronyd checks:

```
systemctl status chronyd
```
```
cat /etc/chrony.conf
```

timesyncd checks:

```
systemctl status systemd-timesyncd
```
```
cat /etc/systemd/timesyncd.conf
```

Time on the node related to NTP service can be checked by running the following command:
- ```
timedatectl status
```
If a private NTP is used, please reach out to VMware by Broadcom Technical Support referencing this KB article.

Additional Information

Generating and populating a kubeconfig does not work in this scenario because the SAML token is considered expired before the user can login.