vSphere Kubernetes Cluster Upgrade Stuck with New Nodes in Provisioning state due to Guest Cluster Cloud Provider Token Expiry and VMs present in the Cluster but missing in the vCenter and Supervisor cluster
search cancel

vSphere Kubernetes Cluster Upgrade Stuck with New Nodes in Provisioning state due to Guest Cluster Cloud Provider Token Expiry and VMs present in the Cluster but missing in the vCenter and Supervisor cluster

book

Article ID: 369279

calendar_today

Updated On:

Products

VMware vSphere with Tanzu vSphere with Tanzu

Issue/Introduction

While connected to the Supervisor cluster context, the following symptoms are present:

  • New machines will not transition from Provisioning to Running state because NodeName(status.nodeRef) will not be set on the machine.
  • The node for the new machine will not have spec.providerID set on it.
    • kubectl get machines -n <affected cluster's namespace>
  • If control plane nodes of the vSphere Kubernetes cluster are affected, the capi-kubeadm-control-plane-controller-manager pods report error messages similar to the following:
    • stderr F {"level":"warn","ts":"YYYY-MM-DDTHH:MM:SS.MSSZ","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0152e2700/etcd-my-cluster-control-plane-abc123","attempt":XX,"error":"rpc error: code = Unavailable desc = etcdserver: request timed out"}

 

While connected to the vSphere Kubernetes cluster context, the following symptoms are present:

  • When viewing the output of "kubectl get nodes", some nodes show as NotReady state. These vms are not present in the vCenter web UI and the Supervisor cluster context.
    • Note: The output of "kubectl get nodes" from within the vSphere Kubernetes context corresponds with the names of the vms in the Supervisor cluster context and the names of the vms in the vCenter web UI. In this scenario, the NotReady nodes from within the vSphere Kubernetes context do not have the expected matching vms in the Supervisor Cluster and vCenter web UI.
    • kubectl get nodes
  • The logs for the guest cluster cloud provider pod in the vSphere Kubernetes cluster report error messages similar to the following:
    • "Error trying to find VM: Unauthorized"

 

This issue can occur during scale-out operations for a cluster which then block upgrading the Kubernetes version of the cluster.

Environment

vSphere 7.0 with Tanzu

vSphere 8.0 with Tanzu

vSphere Kubernetes cluster running on a TKR version lower than TKR v1.29.x

This issue can occur regardless of whether or not this cluster is managed by TMC.

Cause

The supervisor access token mounted into the vSphere Kubernetes cluster's cloud provider is expired.

The cloud provider is not able to update the token if the vSphere Kubernetes cluster's TKR version is lower than v1.29.x.

This issue is fixed in TKR versions v1.29.x and higher.

Resolution

The services responsible for reconciling the nodes and VMs associated with the affected vSphere Kubernetes cluster will need to be restarted.

This includes the guest-cluster-cloud-provider pod within the vSphere Kubernetes cluster.

Note: The output of "kubectl get nodes" from within the vSphere Kubernetes context corresponds with the names of the vms and machines in the Supervisor cluster context as well as the names of the vms in the vCenter web UI. In this scenario, the NotReady nodes from within the vSphere Kubernetes context do not have the expected matching vms in the Supervisor Cluster and vCenter web UI.

    1. Connect into the Supervisor cluster context
    2. Note down the names of virtual machines in the affected vSphere Kubernetes cluster's namespace:
      • kubectl get vm -n <affected vSphere Kubernetes cluster namespace>
    3. Restart the following pods which manage the reconciliation of vSphere Kubernetes cluster nodes:
      • kubectl get deploy -A | grep capi
      • kubectl rollout restart deploy capi-kubeadm-control-plane-controller-manager -n <capi namespace>
      • kubectl rollout restart deploy capi-controller-manager -n <capi namespace>
    4. Connect into the affected vSphere Kubernetes cluster context
    5. Restart the following guest-cluster-cloud-provider pod to regenerate its cloud provider token:
      • kubectl get deploy -A | grep guest-cluster-cloud-provider
      • kubectl rollout restart deploy guest-cluster-cloud-provider -n <cloud provider namespace>
    6. Confirm that the NotReady nodes which are missing matching VMs from the Supervisor cluster context and vCenter web UI are no longer present:
      • Compare the below output with the output from Step 2 above.
      • kubectl get nodes

 

If the issue persists after the above steps, please open a ticket to VMware by Broadcom Support referencing this KB article.

Additional Information

This has been fixed in upstream CPI, where it'll watch for updates to secret which has the token and restart the CPI pod when required.
All TKR versions 1.29 and higher will include the fix.