Worker node stuck in Deleting state after successful drain

search cancel

Worker node stuck in Deleting state after successful drain – Waiting for volume detach

book

Article ID: 422044

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After upgrading from vCenter 7.0 U3r → 8.0 U3a as part of the planned VCF 4.5.2 → 5.2.1 migration, the customer observed that one worker node x
remained stuck in the Deleting phase even after a successful drain operation.

During this upgrade, the Supervisor Cluster was upgraded from v1.26.8 → v1.27.5, and the control API migrated from CAPW to CAPV, changing certain resource ownership and CSI handling behavior.
The node status showed:

DrainingSucceeded=True
InfrastructureReady=True
VolumeDetachSucceeded=False
Reason=WaitingForVolumeDetach
Message=Waiting for node volumes to be detached

Environment

Product: vSphere with Tanzu / Tanzu Kubernetes Grid (Supervisor)
Versions:
- vCenter: 8.0 U3a (upgraded from 7.0 U3r)
- VCF: 5.2.1 (upgraded from 4.5.2)
- Supervisor Cluster: v1.27.5 (upgraded from v1.26.8)
- CAPW → CAPV migration performed (as part of Supervisor API alignment)
CSI Driver: csi.vsphere.vmware.com
Cluster Type: Workload Cluster (Guest Cluster / TKC)

Cause

Thus, the root cause was a missed Detach event handling in the CSI driver post-upgrade, leading to an orphaned VolumeAttachment finalizer that blocked cleanup.

Resolution

Manual Resolution Steps:

1.Verify in vCenter that the disk (UUID :yyyy) is not attached to any VM.

2. Identify the stuckVolumeAttachment:

#kubectl get volumeattachments -A -o wide | grep csi-xxx

3.Edit the object and remove the finalizer:
#kubectl edit volumeattachments.storage.k8s.io csi-xxx

Below similar  lines need  below to be deleted:
finalizers:
- external-attacher/csi-vsphere-vmware-com

4. Validate that the object was deleted automatically:
#kubectl get volumeattachments -A | grep csi-xxx
#kubectl get nodes

Additional Information

Once the finalizer was removed, Kubernetes immediately garbage-collected the orphaned VolumeAttachment.
The stuck worker node x was automatically deleted by the Cluster API controller.
The cluster reconciled successfully — no pending Machine or VolumeAttachment objects remained.
Validation through kubectl get nodes confirmed that all nodes were in Ready state.

Feedback

thumb_up Yes

thumb_down No