vSphere Supervisor Orphaned/Stale Workload Cluster Nodes Clean Up

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After an upgrade or change to the vSphere Supervisor Workload Cluster, the cluster's context shows nodes in NotReady,SchedulingDisabled state.

These nodes do not appear in the vSphere web client or Supervisor cluster context. As a result, these nodes are considered orphaned or stale.

NotReady,SchedulingDisabled state indicates that the system has cordoned the node and attempting to drain the pods off of the node to delete it.

In the vSphere web client, there are no virtual machines matching the node names stuck in NotReady,SchedulingDisabled state.

While connected to the Supervisor cluster context, the following symptoms are present:

The nodes in NotReady,SchedulingDisabled state are not present in the Supervisor cluster context:
- If the environment is on vSphere 8.X or higher:
```
kubectl get vm,vspheremachine,machine -o wide -n <affected cluster namespace>
```
- If the environment is on vSphere 7.X:
```
kubectl get vm,wcpmachine,machine -o wide -n <affected cluster namespace>
```
Note: If there are vms and vspheremachines/wcpmachines for the node, that is a separate issue than this KB article.

While connected to the affected Workload cluster context, the following symptoms are present:

There are nodes in NotReady,SchedulingDisabled state:
```
kubectl get nodes
```
There are no Attached True volumeattachments to the nodes in NotReady,SchedulingDisabled state:
```
kubectl get volumeattachment -A -o wide | grep <NotReady,SchedulingDisabled node name>
```
Note: If there are still volumes attached to the node, the reason why the volumes not detaching will need to be investigated.

The kube-controller-manager pod logs may show similar error messages to the below specific to the node(s) stuck in NotReady,SchedulingDisabled state:

kubectl get pods -n kube-system | grep kube-controller

kubectl logs -n kube-system <kube-controller-manager pod name>

node_lifecycle_controller.go:177] unable to delete node "<node-stuck-notready-schedulingdisabled>": nodes "<node-stuck-notready-schedulingdisabled>" is forbidden: User "system:serviceaccount:vmware-system-cloud-provider:cloud-provider-svc-account" cannot delete resource "nodes" in API group "" at the cluster scope

Environment

vSphere Supervisor 7.0

vSphere Supervisor 8.0

vSphere Supervisor 9.0

This issue can occur regardless of if the affected workload cluster is managed by Tanzu Mission Control (TMC) or not.

Cause

The cluster lifecycle workflow did not properly finish cleaning up the node from within the affected vSphere Kubernetes cluster.

This is a known issue in TKR versions: 1.28.8, 1.29.4, 1.30.1, 1.31.1

This issue has been resolved in TKR versions: 1.28.15, 1.29.12, 1.30.8, 1.31.4, 1.32.0 and above.

Resolution

This issue has been resolved in TKR and KR versions: 1.28.15, 1.29.12, 1.30.8, 1.31.4, 1.32.0 and above.

If these nodes do not appear in the vSphere web client and Supervisor cluster context, these nodes are considered orphaned or stale and must be cleaned up manually.

Please reach out to VMware by Broadcom Technical Support for assistance in properly cleaning up these orphaned or stale nodes.