kube-apiserver logs errors "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

System Pods show various "Unauthorized" error messages like below:

kube-apiserver:

E0926 08:17:12.258930       1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
E0926 08:17:12.260342       1 authentication.go:73] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
:

or

E0925 06:29:59.136187       1 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, service account token is not valid yet]"
E0925 06:29:59.310443       1 authentication.go:70] "Unable to authenticate the request" err="[invalid bearer token, service account token is not valid yet]"
:

kube-controller-manager:

2024-09-26T08:17:12.262161615Z stderr F E0926 08:17:12.261868       1 resource_quota_controller.go:417] failed to discover resources: the server has asked for the client to provide credentials
2024-09-26T08:17:12.262681883Z stderr F W0926 08:17:12.262532       1 garbagecollector.go:754] failed to discover preferred resources: the server has asked for the client to provide credentials

etcdserver:

024-09-26T08:17:05.534198795Z stderr F {"level":"warn","ts":"2024-09-26T08:17:05.533987Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"masked-peer-id","clock-drift":"59m27.567744675s","rtt":"11.061855ms"}
2024-09-26T08:17:05.537472675Z stderr F {"level":"warn","ts":"2024-09-26T08:17:05.537303Z","caller":"rafthttp/probing_status.go:82","msg":"prober found high clock drift","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"masked-peer-id","clock-drift":"59m27.571619047s","rtt":"3.30468ms"}

metrics-server:

2024-09-26T08:18:39.457241Z stderr F E0926 08:18:39.455753       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="tkc1-cxfc5-q74z5"
2024-09-26T08:19:39.432411267Z stderr F E0926 08:19:39.430930       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="tkc1-cxfc5-q74z5"
2024-09-26T08:20:39.437397588Z stderr F E0926 08:20:39.437186       1 scraper.go:140] "Failed to scrape node" err="request failed, status: \"401 Unauthorized\"" node="tkc1-cxfc5-q74z5"

guest-cluster-auth-svc:

2024-09-26T08:17:12.253849165Z stderr F E0926 08:17:12.253415       1 token_review_endpoint.go:94] Invalid token: failed to validate JWT
2024-09-26T08:17:12.254337439Z stderr F E0926 08:17:12.254178       1 token_review_endpoint.go:94] Invalid token: failed to validate JWT
:

guest-cluster-cloud-provider:

:
W0926 08:58:56.828195       1 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ConfigMap: Unauthorized
E0926 08:58:56.828270       1 reflector.go:140] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Unauthorized

Newly deployed or recreated Pods keep restarting, crashing and show <invalid> in the AGE or RESTART column as below:

$ kubectl get pod -A | grep metrics
kube-system                    metrics-server-6955c5ffd6-65bbt                             0/1     Running   3 (10s ago)          <invalid>
kube-system                    metrics-server-cf65bc777-fvpr4                              1/1     Running   0                    <invalid>
:
vmware-system-csi              vsphere-csi-controller-f94c88948-6qjwq                      6/6     Running   4 <invalid>  15d
:

Environment

vSphere with Tanzu

VMware Tanzu Kubernetes Grid Multicloud

VMware Tanzu Kubernetes Grid Integrated

Cause

One of control plane node had highly drifted clock with almost an hour difference.

Few common reasons for time/clock drift are:

Time has been incorrectly set by an operator by mistake.
A reboot caused the system clock to sync with an incorrect hardware clock
NTP is unable to sync time with the NTP server due to connectivity issues.

Resolution

Make sure the time on the vCenter server and ESXi host are synchronized.
Make sure the hardware clock is not far off the system time on all the nodes. You can use the command,"sudo hwclock --systohc" to sync the hardware clock with the current system clock.
When time is configured manually, validate if timezone is local time or UTC time.

Additional Information

You can run below simple script from a Supervisor CPVM to check system times of all the nodes in a given TKC.

# for ip in `k get vm -A -o wide | grep <tkc-name> | awk '{ print $6}'` ; do ssh -o StrictHostKeyChecking=no -i <tkc.pem> vmware-system-user@${ip} "date"; done

where:

<tkc-name> is the name of the TKC

<tkc.pem> is the filename of the ssh-privatekey retrieved and base64 decoded from the <tkc-name>-ssh secret of the Supervisor cluster.