TKG Worker Node Stuck in deleting state due to Volume Detach failure
search cancel

TKG Worker Node Stuck in deleting state due to Volume Detach failure

book

Article ID: 438053

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • A Tanzu Kubernetes Grid (TKG) worker node transitions to a NotReady or SchedulingDisabled state.
  • The associated Machine object remains stuck in the Deleting phase.
  • The Machine event status shows:
    Message: Waiting for node volumes to be detached
    Reason: WaitingForVolumeDetach

  • In the csi-attacher container logs of the csi-controller pod, errors similar to the following are observed:
    Error processing "csi-####-####-####": failed to detach: could not mark as detached: volumeattachments.storage.k8s.io "csi-####-####-####" not found

  • Checking the affected virtual machine in the vCenter UI reveals that multiple VMDK files remain physically attached to the node, and Kubernetes shows multiple application pods stuck in a Terminating state on the affected node.

Environment

  • VMware vSphere Kubernetes Service
  • VMware Supervisor
  • vCenter Server 8.x

Cause

  • A state desynchronization occurs between the Kubernetes VolumeAttachment API objects and the vSphere Cloud Native Storage (CNS) layer.
  • This happens when application pods or VolumeAttachment objects are removed before the CSI driver can successfully complete the physical disk detachment from the VM hardware.
  • Pods stuck in a Terminating state hold mount points open, creating "phantom" attachments that block Cluster API (CAPI) finalizers from deleting the Machine object.

Resolution

  1. Identify the application pods that are stuck in the Terminating state on the affected worker node.
  2. Force delete the terminating pods to release the persistent volume mount points by running the following command:
    kubectl delete pod <POD_NAME> -n <NAMESPACE> --force --grace-period=0

  3. If the pods are created by a Deployment, ReplicaSet or StatefulSet. They will be redeployed on other worker nodes.
  4. Monitor the vCenter UI and verify that all associated VMDKs are successfully detached from the affected worker node VM.
  5. Once the volumes are physically detached, the stalled vSphere CSI reconciliation loop will bypass the blockade, the stale Machine object will be removed, and the MachineDeployment will automatically scale up and recreate a healthy replacement worker node.