TKG Worker Node Stuck in deleting state due to Volume Detach failure
book
Article ID: 438053
calendar_today
Updated On:
Products
VMware vSphere Kubernetes Service
Issue/Introduction
A Tanzu Kubernetes Grid (TKG) worker node transitions to a NotReady or SchedulingDisabled state.
The associated Machine object remains stuck in the Deleting phase.
The Machine event status shows: Message: Waiting for node volumes to be detached Reason: WaitingForVolumeDetach
In the csi-attacher container logs of the csi-controller pod, errors similar to the following are observed: Error processing "csi-####-####-####": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-####-####-####" not found
Checking the affected virtual machine in the vCenter UI reveals that multiple VMDK files remain physically attached to the node, and Kubernetes shows multiple application pods stuck in a Terminating state on the affected node.
Environment
VMware vSphere Kubernetes Service
VMware Supervisor
vCenter Server 8.x
Cause
A state desynchronization occurs between the Kubernetes VolumeAttachment API objects and the vSphere Cloud Native Storage (CNS) layer.
This happens when application pods or VolumeAttachment objects are removed before the CSI driver can successfully complete the physical disk detachment from the VM hardware.
Pods stuck in a Terminating state hold mount points open, creating "phantom" attachments that block Cluster API (CAPI) finalizers from deleting the Machine object.
Resolution
Identify the application pods that are stuck in the Terminating state on the affected worker node.
Force delete the terminating pods to release the persistent volume mount points by running the following command: kubectl delete pod <POD_NAME> -n <NAMESPACE> --force --grace-period=0
If the pods are created by a Deployment, ReplicaSet or StatefulSet. They will be redeployed on other worker nodes.
Monitor the vCenter UI and verify that all associated VMDKs are successfully detached from the affected worker node VM.
Once the volumes are physically detached, the stalled vSphere CSI reconciliation loop will bypass the blockade, the stale Machine object will be removed, and the MachineDeployment will automatically scale up and recreate a healthy replacement worker node.