While connected to the Supervisor cluster context, the following symptoms are present:
kubectl get machines -n <affected cluster's namespace>
stderr F {"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.MSSZ","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0152e2700/etcd-my-cluster-control-plane-abc123","attempt":XX,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}
While connected to the vSphere Kubernetes cluster context, the following symptoms are present:
kubectl get nodes
"Error trying to find VM: Unauthorized"
This issue can occur during scale-out operations for a cluster which then block upgrading the Kubernetes version of the cluster.
vSphere 7.0 with Tanzu
vSphere 8.0 with Tanzu
vSphere Kubernetes cluster running on a TKR version lower than TKR v1.29.x
This issue can occur regardless of whether or not this cluster is managed by TMC.
The supervisor access token mounted into the vSphere Kubernetes cluster's cloud provider is expired.
The cloud provider is not able to update the token if the vSphere Kubernetes cluster's TKR version is lower than v1.29.x.
This issue is fixed in TKR versions v1.29.x and higher.
The services responsible for reconciling the nodes and VMs associated with the affected vSphere Kubernetes cluster will need to be restarted.
This includes the guest-cluster-cloud-provider pod within the vSphere Kubernetes cluster.
Note: The output of "kubectl get nodes" from within the vSphere Kubernetes context corresponds with the names of the vms and machines in the Supervisor cluster context as well as the names of the vms in the vCenter web UI. In this scenario, the NotReady nodes from within the vSphere Kubernetes context do not have the expected matching vms in the Supervisor Cluster and vCenter web UI.
kubectl get vm -n <affected vSphere Kubernetes cluster namespace>
kubectl get deploy -A | grep capi
kubectl rollout restart deploy capi-kubeadm-control-plane-controller-manager -n <capi namespace>
kubectl rollout restart deploy capi-controller-manager -n <capi namespace>
kubectl get deploy -A | grep guest-cluster-cloud-provider
kubectl rollout restart deploy guest-cluster-cloud-provider -n <cloud provider namespace>
kubectl get nodes
If the issue persists after the above steps, please open a ticket to VMware by Broadcom Support referencing this KB article.
This has been fixed in upstream CPI, where it'll watch for updates to secret which has the token and restart the CPI pod when required.
All TKR versions 1.29 and higher will include the fix.