After an upgrade or change to the vSphere Supervisor Workload Cluster, the cluster's context shows nodes in NotReady,SchedulingDisabled state.
These nodes do not appear in the vSphere web client or Supervisor cluster context. As a result, these nodes are considered orphaned or stale.
NotReady,SchedulingDisabled state indicates that the system has cordoned the node and attempting to drain the pods off of the node to delete it.
In the vSphere web client, there are no virtual machines matching the node names stuck in NotReady,SchedulingDisabled state.
While connected to the Supervisor cluster context, the following symptoms are present:
kubectl get vm,vspheremachine,machine -o wide -n <affected cluster namespace>
kubectl get vm,wcpmachine,machine -o wide -n <affected cluster namespace>
While connected to the affected Workload cluster context, the following symptoms are present:
kubectl get nodes
kubectl get volumeattachment -A -o wide | grep <NotReady,SchedulingDisabled node name>
kubectl get pods -n kube-system | grep kube-controller
kubectl logs -n kube-system <kube-controller-manager pod name>
node_lifecycle_controller.go:177] unable to delete node "<node-stuck-notready-schedulingdisabled>": nodes "<node-stuck-notready-schedulingdisabled>" is forbidden: User "system:serviceaccount:vmware-system-cloud-provider:cloud-provider-svc-account" cannot delete resource "nodes" in API group "" at the cluster scope
vSphere Supervisor 7.0
vSphere Supervisor 8.0
vSphere Supervisor 9.0
This issue can occur regardless of if the affected workload cluster is managed by Tanzu Mission Control (TMC) or not.
The cluster lifecycle workflow did not properly finish cleaning up the node from within the affected vSphere Kubernetes cluster.
This is a known issue in TKR versions: 1.28.8, 1.29.4, 1.30.1, 1.31.1
This issue has been resolved in TKR versions: 1.28.15, 1.29.12, 1.30.8, 1.31.4, 1.32.0 and above.
This issue has been resolved in TKR and KR versions: 1.28.15, 1.29.12, 1.30.8, 1.31.4, 1.32.0 and above.
If these nodes do not appear in the vSphere web client and Supervisor cluster context, these nodes are considered orphaned or stale and must be cleaned up manually.
Please reach out to VMware by Broadcom Technical Support for assistance in properly cleaning up these orphaned or stale nodes.